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Preface 


Applied Numerical Analysis is written for sophomores and juniors in engineering, science, 
mathematics, and computer science. It is also valuable as a sourcebook for practicing 
engineers and scientists who need to use numerical procedures to solve problems. We 
have been gratified to find that many who purchased this book as students keep it as a 
permanent part of their reference library because its readability and breadth allow them 
to expand their knowledge of the subject by self-study, 

While it is assumed that the reader has a good knowledge of calculus, appropriate 
topics are reviewed in the context of their use, and an appendix gives a summary of the 
most important items that are used to develop and analyze numerical procedures, The 
mathematical notation is purposely kept simple for clarity. 


PURPOSE 


The purpose of this fourth edition is the same as in previous editions: to give a broad 
coverage of the field of numerical analysis, emphasizing its practical applications rather 
than theory. At the same time, methods are compared, errors are analyzed, and relation- 
ships to the fundamental mathematical basis for the procedures are presented so that a 
true understanding of the subject is attained. Clarity of exposition, development through 
examples, and logical arrangement of topics aid the student to become more and more 
adept at applying the methods. 


CONTENT FEATURES 


Applied Numerical Analysis has enjoyed significant success because of several outstanding 
features. These are retained and amplified in this fourth edition: 
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* The unusually large number of exercises allows the instructor to select those that are 
appropriate for the background and interests of the students. These are keyed to the 
corresponding section of the chapter to assist in this. When the reader is using the book 
for self-study, the many exercises are an important supplement to the text. In addition 
to the practice exercises, each chapter has “Applied Problems and Projects” that are 
more challenging and that illustrate many fields of application of the various numerical 
procedures. 


By solving the same problem with several different methods, the relative efficiency and 
effectiveness of the methods become clear (pp. 11, 12, 13, and 27; pp. 98 and 111; 
pp. 350, 353, 358, 362, 365, and 368). 


A short summary of its contents is given at the beginning of each chapter to provide 
a preview of what is coming (pp. 1, 86, 180, 264, 347, and so on). 


Most chapters are introduced with an easily understood example that applies the material 
of the chapter. This motivates the student and shows that the material is of real utility 
(pp. 2, 87, 181, 349, 472, and so forth). 

Each chapter ends with a summary that reminds the student of what has been covered 


and suggests that appropriate review be done to ensure that nothing has been overlooked 
or not learned (pp. 49, 147, 235, 321, and so on). 


. 


. 


The coverage of partial-differential equations in an easily understood manner is unusual 
in a book at this level. 


Postponing the treatment of computer arithmetic and errors until after the student has 
been exposed to some numerical methods puts this important topic into proper context 
and helps in the appreciation of its significance as one factor in the accuracy of the 
computed result (p. 38). 


Computer programs in FORTRAN are given at the conclusion of each chapter. These 
implement the more important algorithms and serve as easily understood examples of 
how the computer can be used to carry out the computations. They do not pretend to 
be at a professional level of programming, because their purpose is illustrative and 
clarity would otherwise be sacrificed. We recognize that many instructors prefer a more 
structured language, but the presence of many subroutines in FORTRAN dictates our 
choice. We do provide a Pascal version of the programs in the supplements to the text; 
both versions are provided on diskettes for ease of entering into the local computer 
system. 


FEATURES OF THIS EDITION 


We have benefited by suggestions from those who have used the book in its previous 

editions. Significant new material has been added, and some chapters have been improved 

through rearrangement or rewriting. The format of the book has been improved. The use 

of a second color highlights important items and adds interest and variety to the text. 
Important revisions include the following: 


* The section on Muller's method in Chapter 1 has been moved up, and the difference 
between this and methods based on a linear approximation to the function has been 
emphasized (p. 18). 
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Sections on computer arithmetic and errors have been rewritten and expanded (pp. 
38-46). 

The use of matrix algebra has been broadened throughout the book. This is especially 
apparent in the discussion of iterative methods in Chapters 2 and 6. 


The relation between Gauss, Gauss—Jordan, and other LU methods has been clarified 
(p. 106). 


The chapter on interpolation has been streamlined and the use of divided differences 
has been emphasized. 

Other spline-based methods have been added and their use in approximating surfaces 
has been discussed (pp. 217 and 233). 

The section on adaptive integration has been improved (p. 305). 


We have given more modern versions of higher-order Runge-Kutta methods (p. 358). 


A most significant addition has been an elementary treatment of the finite-element 
method. This is introduced by an explanation of variational methods in one-dimensional 
boundary-value problems (p. 426). Finite elements are then applied to elliptic equations 
(p. 512) and to parabolic equations (p. 573). 


A discussion of the QR method for finding eigenvalues is included in Chapter 6. 


The section on fast Fourier transforms has been rewritten and expanded (p. 652). 


More of the programs at the end of each chapter have sample output for the reader to 
examine. Moreover, we have included new programs that implement adaptive integra- 
tion, divided differences, the Runge—Kutta—Fehlberg method, and the QR technique. 
We have replaced a program in Chapter 2 with one that implements the LU decom 
position method for Gaussian elimination, 


The appendix that describes software packages has been updated to cover some newer 
software, especially that for personal computers. 


PEDAGOGICAL FEATURES 

We recognize that the student is the most important part of the teaching/learning system 
and have tried to facilitate his or her understanding by several items, some of which are 
new in this edition: 


Sections that preview the material lead off each chapter. 


A chapter summary at the end of each chapter reminds the student of what should have 
been learned. 


The second color makes it easier to recognize the more important equations and algo- 
rithms. Color in the illustrations adds interest and makes their message clearer. 


The sections that have been rewritten eliminate some items that may have caused 
confusion for some students. 


The bibliography has been updated, and selected references have been given at the end 
of each chapter. This makes it easy for the student to find other treatments of the 
material, some of which go into more detail than we are able to. 
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SUPPLEMENTS 


A number of supplements are available to assist the instructor: 


+ A Solutions Manual gives the answers to nearly all the exercises and to some of the 
“Applied Problems and Projects.” Hints for other problems or projects are also provided. 
Copies of the programs, both in the FORTRAN version that is printed in the text and 
in an equivalent Pascal version, are available to adopters. These are on diskettes in 
IBM PC-—compatible form so that they can readily be entered into the local computer 
system if the instructor wishes to make them available to the student on-line. Alter- 
natively, copies can be provided to students when personal computers are to be used. 


* The Solutions Manual gives suggestions for how the instructor can select from the text 
when he or she does not have time to cover all of it. Since our coverage of topics in 
numerical analysis is unusually broad, such selection is frequently necessary. 
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Solving Nonlinear Equations 


1.0 CONTENTS OF THIS CHAPTER 


Chapter | introduces you to numerical analysis by explaining how methods of 
successive approximations can find the solution of a single nonlinear equation. 
It explains how computer programs can be written to obtain such solutions and 
gives numerous examples that implement the methods of the chapter. 


1.2 


THE LADDER IN THE MINE 

Illustrates how the methods of the chapter are applied to a realistic problem. 
METHOD OF HALVING THE INTERVAL 

Describes an ancient and simple method. 

METHOD OF LINEAR INTERPOLATION 


Presents two improvements on halving of intervals; it solves f(x) = 0 by using a 
secant line. 


NEWTON’S METHOD 


Is a widely used and rapidly convergent method that uses a tangent line to find a 
root. 


MULLER’S METHOD 
Uses a parabola to approximate the function. 
USE OF x = g(x) METHOD 


Departs from the previous techniques and establishes some theory that is used in 
analyzing the various methods and in setting up for accelerating convergence. 
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EXAMPLE 


1.7 CONVERGENCE OF NEWTON'S METHOD 
Typifies the andl¥sis of the convergence properties of a numerical method. 
1.8 METHODS FOR POLYNOMIALS 
Develops Newton’s method as applied to this important class of functions. 
1.9 BAIRSTOW’S METHOD FOR QUADRATIC FACTORS 
Is a specialized method that gets the roots of polynomials two at a time. 
1.10 OTHER METHODS FOR POLYNOMIALS 
Describes the QD algorithm and Graeffe’s method. 
1.11 COMPUTER ARITHMETIC AND ERRORS 
Describes the various types of errors, including some due to the computer, 
1.12. FLOATING-POINT ARITHMETIC AND ERROR ESTIMATES 
Deals with the way that numbers are stored in computers. 
1.13. PROGRAMMING FOR NUMERICAL SOLUTIONS 
Shows, through an example, several things of importance in writing computer 
programs. 
1.14 CHAPTER SUMMARY 
Gives you a checklist against which you can measure your understanding of the 
topics of this chapter. Obviously, you should restudy those sections where your 
comprehension is not complete. 
1.15 COMPUTER PROGRAMS 


Illustrates the process of program development and gives programs that implement 
the methods of the chapter. 


THE LADDER IN THE MINE 


In this book we will begin most chapters with an example that illustrates the application 
of the numerical techniques covered in the chapter. We will frame these in the context 
of the real world but simplified. This first example is typical and defines the problem, 
describes how it can be solved, and ends by pointing out how numerical methods are 
useful in getting the solution. 

It is not uncommon, in applied mathematics, to have to solve a nonlinear equation 
If you worked for a mining company the following might be a typical problem. 


There are two intersecting mine shafts that meet at an angle of 123°, as shown in Fig. 
1.1. The straight shaft has a width of 7 ft, while the entrance shaft is 9 ft wide. What 
is the longest ladder that can negotiate the turn? You can neglect the thickness of the 
ladder members and assume it is not tipped as it is maneuvered around the corner. Your 
solution should provide for the general case in which the angle A is a variable, as well 
as the widths of the shafts. 
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Figure I.1 


Whenever a scientific or engineering problem is solved, there are four general steps 
to be followed: 


1, State the problem clearly, including any simplifying assumptio 


2. Develop a mathematical statement of the problem in a form that can be solved 
for a numerical answer. This may involve, as in the present case, the use of 
calculus. In other situations, other mathematical procedures may be employed. 

3. Solve the equation(s) that result from step 2. Sometimes this is a method from 
algebra, but frequently more advanced methods will be needed. The subject 
matter of this text is numerical procedures that are quite powerful and of general 
applicability. The result of this step is a numerical answer or set of answers, 

4. Interpret the numerical result to arrive at a decision. This will require experience 
and an understanding of the situation in which the problem was embedded. This 
interpretation is the hardest part of solving problems and must be learned on the 
job. This book will emphasize step 3 and will deal to some extent with steps | 
and 2, but step 4 cannot be meaningfully treated in the classroom. 


The above description of the problem has taken care of step 1. Now for step 2: 


Here is one way to analyze our ladder problem. Visualize the ladder in successive 
locations as we carry it around the corner; there will be a critical position in which the 
two ends of the ladder touch the walls while a point along the ladder touches the corner 
where the two shafts intersect. (See Fig. 1.2.) Let C be the angle between the ladder and 
the wall when in this critical position. It is usually preferable to solve problems in general 
terms, so we work with variables C, A, B, w,, and wy. 

Consider a series of lines drawn in this critical position—their lengths vary with the 
angle C, and the following relations hold (angles are expressed in radian measure): 


=e, a 

{ane 6" Sine? 

B= ¢— A: — Cj 
_ = Wy Wy 
=(+6=- +2 

eS By a sini —A—C)~ sinc 
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Figure 1.2 
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The maximum length of ladder that can negotiate the turn is the minimum of ¢ as a 
function of angle C. We hence set dé/dC = 0. 


d€ _ w2cos(m-A-C)_ wycosC _ 
dC sina — A — C) anaes 
We can solve the general problem if we can find the value of C that satisfies this 


equation. With the critical angle determined, the ladder length is given by 


Sy Eee La 
f sin@ - A-©)~* sin C” 


As this analysis shows, to solve the specific problem we must solve a transcendental 
equation for the value of C: 


9 cos(7 — 2.147 —C)  TcosC _ 


sin(r —2.147—=C) sinc ° 


and then substitute C into 


p pete a eae eee 
sin(@ — 2.147 -—C)  sinC’ 


where we have converted 123° into 2.147 radians. 

Finding the solution to an algebraic or transcendental equation, as we must do here, 
is the topic of this first chapter. 

In this chapter we study methods to find the roots of an equation, such as in our 
ladder-in-the-mine example. Much of algebra is devoted to the “solution of equations.” 
In simple situations, this consists of a rearrangement to exhibit the value of the unknown 
variable as a simple arithmetic combination of the constants of the equation. For second- 
degree polynomials, this can be expressed by the familiar quadratic formula. For third- 
and fourth-degree polynomials, formulas exist but are so complex as to be rarely used; 
for higher-degree equations it has been proved that finding the solution through a formula 
is impossible. Most transcendental equations (involving trigonometric or exponential 
functions) are likewise intractable. 
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Even though it is difficult if not impossible to exhibit the solution of such equations 
in explicit form, numerical analysis provides a means where a solution may be found, 
or at least approximated as closely as desired. Many of these numerical procedures follow 
a scheme that may be thought of as providing a series of successive approximations, each 
more precise than the previous one, so that enough repetitions of the procedure eventually 
give an approximation that differs from the true value by less than some arbitrary error 
tolerance. Numerical procedures are thus seen to resemble the limit concept of mathe- 
matical analysis. 

When a numerical solution that satisfies the transcendental equation above has been 
obtained, we have completed step 3 of the general procedure. The rest of this chapter 
treats several methods for doing this. We will not do step 4, but in this case it would 
consist of deciding if the maximum-length ladder that can be carried into the mine is 
long enough. If it is not, a decision must be made about the remedy. Perhaps this would 
be to use an extension ladder or to cut a notch in the corner of the wall. 

This example shows how important calculus can be in solving practical problems. 
You will also find that calculus is a critical component in the analysis of numerical 
methods. Appendix A provides a summary of some of the most important elements of 
calculus. Look this over now to see if there are items that you should review. 


METHOD OF HALVING THE INTERVAL 


This first chapter describes methods for solving equations; that is, given an equation of 
the form f(x) = 0, what value(s) of x satisfy the equation? There are several obvious 
possibilities that we will not cover, such as trial and error (trying various values of x 
until we discover one that works) and drawing a graph of values of f(x) versus x-values, 
seeing where the plot crosses the x-axis. Our methods will be more systematic than these, 
although a rough graph is frequently helpful in understanding the nature of the function 
and approximately where the function has roots. 

The first numerical procedure that we will study is that of interval halving.* Consider 
the cubic 

f(x) = +x? -3x-3=0. 


Atx = 1, f has the value —4, At x = 2, f has the value +3. Since the function is 
continuous, it is obvious that the change in sign of the function between x = | and x = 
2 guarantees at least one root on the interval (1, 2), (See Fig. 1.3.) 

Suppose we now evaluate the function at x = 1.5 and compare the result to the 
function values at x = | and x = 2. Since the function changes sign between x = 1.5 
and x = 2, a root lies between these values. We can obviously continue this interval 
halving to determine a smaller and smaller interval within which a root must lie. For this 
example, continuing the process leads eventually to an approximation to the root at x = 
V3 = 1.7320508075. . . . The process is illustrated in Fig. 1.4. 


*The method, also known as the Bolzano method, 1s of ancient origin. Some authors call it the bisection method. 


Figure 1.3 


Figure 1.4 


|.2: METHOD OF HALVING THE INTERVAL 7 


While a graphic method, as illustrated in Fig. 1.4, may be suitable if we want only 
an approximate answer, to obtain more accuracy we need to write a rule to do it math- 
ematically. We should also express our algorithm (the technical name for a systematic 
procedure) in a way that makes it easy to implement the method with a computer program. 
We shall adopt a style of expressing algorithms that emphasizes the orderly structure. 


Method of Halving the Interval (Bisection Method) 


To determine a root of f(x) = 0 that is accurate within a specified tolerance 
value, given values of x, and x2 such that f(x) * f(x2) < 0, 


REPEAT 
Set x3 = (x, + x2)/2. 
IF f(x) * fm) < 0: 
Set x2 = x3. 
ELSE Set x; = x3. 
ENDIF. 
UNTIL (|x, — x2| < tolerance value) OR f(x3) = 0. 


The final value of x; approximates the root; it is in error by not more than 
3lt1 — %2l- 


Note. The method may give a false root if f(x) is discontinuous on [x,, x2]. 


Applying the method to f(x) = x* + x? — 3x — 3 = 0, we get the results of Table 
1.1. The repetition of our algorithm is called iteration, and the successive approximations 
are termed the iterates. 


Table |.| Method of halving the interval for f(x) = x3 + x? - 3x -3=0 


EES 


Maximum 

Iteration error in 
number x; x2 Xa fix) fixy fix) xy 

1 1 2 1.5 -4.0 3.0 —1.875 0.5 

2 1.5 2 1.75 —1.875 3.0 0.17187 0.25 

3 1.5 1.75 1.625 — 1.875 0.17187 —0,94335 0.125 

4 1,625 1.75 1.6875 —0,94335 0.17187 —0.40942 0.0625 

5 1.6875 1.75 1.71875 —0.40942 0.17187 —0.12478 0.03125 

6 1.71875 1.75 1.73437 —0.12478 0.17187 0.02198 0.015625* 

Ff 1.71875 1.73437 1.72656 0).0078125 

x 1.73205 —0.00000 - 


* Actual eror in 1, after five iterations is 0.01330. 
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In this first chapter we expect you to do most of your calculations using a hand 
calculator.* Later in this chapter we will discuss computer programs that carry out the 
computations. Please veriff’some of the values in Table 1.1 now. 

The entries in Table 1.1 indicate the necessity of representing values of the argument 
x as well as of the function f(x) only approximately when we carry a limited number of 
decimal figures. In floating-point operations on digital computers, there is a similar 
inaccuracy in our work because computers retain only a limited number of significant 
digits. Accuracy of numbers in computers is discussed in Section 1.12. Note that this is 
true in all computations, not just in numerical methods. We will give attention to such 
“round-off errors” later. The distinction between numerical methods and numerical anal- 
ysis is that the latter term implies the consideration of errors in the procedure used. 
Certainly the blind use of any calculation method without concer for its accuracy is 
foolish. 

Whether one rounds to the nearest fractional value or chops off, the extra digits will 
make a difference in the effect of the round-off errors. In Table 1.1, the figures have been 
chopped after five places, which is similar to the action of many digital computers. 

In addition to the limitation on accuracy because we retain only a limited number of 
figures in our work, there is an obvious limitation if we terminate the procedure itself 
too soon. One important advantage of the interval-halving method, beyond its simplicity, 
is our knowledge of the accuracy of the current approximation to the root. Since a root 
must lie between the x-values where the function changes a sign,” the error in the last 
approximation can be no more than one-half the last interval of which it is the midpoint. 
This interval is known exactly, since the original difference, |x, — x>|, is halved at each 
iteration. For other method: ination is more difficult. 


e relative error is often the better measure of accuracy for very large or 
values. Sometimes the accuracy is expressed as the number of digits that are 
correct; in other cases, the number of correct digits after the decimal point is used. When 
the true value is unknown, it is impossible to express the accuracy with exactness, and 
approximate accuracy must be specified. 

The method of halving the interval applies equally well to transcendental equations, 
as do the other methods of this chapter. Table 1.2 shows the results when we apply the 
method to f(x) = e* — 3x = 0, which has a root between x = | and x = 2. 

The method of interval halving requires that starting values be obtained before the 
method can begin. This is true of most methods for root finding. Getting these starting 
values can be done by making a rough graph, by trial calculations, or by writing a search 
program on a computer or programmable calculator. Perhaps the best way is through 
interactive graphics, letting the computer draw curves at the direction of the user and 
varying the parameters at the console to find approximate values of roots. 


*There are several good reasons for this. First, we want you to concentrate your attention on the algorithms, 
and if you were struggling to get a program to work, it might divert your attention. Second, you will get a 
better “feel” for what is happening if you are deeply involved in the successive steps of the computations. 
Finally, not every reader of the book is already an expert at programming. Postponing the use of programs can 
help you get up to speed in your programming at the same time that you start on what we think is a fascinating 
subject—numerical analysis. 

‘Observe that, if the function is discontinuous, f(x) may change sign without having a root in the interval. 
Unknown functions should be examined for continuity before you attempt to evaluate their roots. 
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Table 1.2 Halving the interval for f(x) = ex — 3x = 0 


Iteration 
number 


re 


xi 


ino 


tn ininin 


Maximum 
error in 
2X x3 f(x) Sf (%) f(x) X3 
2.0 15 —0.28172 1.38906 —0.01831 0.5 
2.0 1.75 —0.01831 1.38906 0,50460 0.25 
1.75 1.625 —0.01831 0.50460. 020342 0.125 
1.62: 1.5625 —0.01831 0.20342 0.08323 0.0625 
1.5625 1.53125 0.01831 0.08323 0.03020 0.03125 
1.53125 1.51562 —0.01831 003020 0.00539 0.015625* 
1.51213 


*Actual error in xy alter five iterations is —0.01912 


1.3 


METHOD OF LINEAR INTERPOLATION 


While the interval-halving method is easy and has simple error analysis, it is not very 
efficient. For most functions, we can improve the rate at which we converge to the root. 
One such method is the method of linear interpolation.* Suppose we assume that the 
function is linear over the interval (x), x3), where f(x,) and f(x) are of opposite sign. 
From the obvious similar triangles in Fig. 1.5 we can write’ 


© faa) — flay)’ 


X>) 


— Xj). 


~ flag) — fey? 


We then compute f(x,) and again interpolate linearly between the values at which 
the function changes sign, giving a new value for x3. Repetition of this will give improving 
estimates of the root. Table 1.3 shows the results of this method for the same polynomial 
discussed in Section 1.2. The method appears to be somewhat faster than the method of 
halving the interval, giving about the same accuracy after three steps as was obtained 


*This is also known as the method of false position, and by the Latinized version regula falsi. It is also a very 


old method. 

‘Note that, since [ (x2) — f(x,)]/(%. — x) is the slope of the secant line, which approximates the slope of the 
function in the neighborhood of the root, the equation can be considered to be x, = x) — f(x2)/(slope of 
function). Compare to Newton's method, in the next section. 
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f(x) 


fix.) — f(x) 


Figure 1.5 


there in seven. It is intuitively obvious that the speed with which the successive approx- 
imations approach the zero of the function will depend on the degree to which the function 
departs from a straight line in the interval of consideration. In other words, the rate of 
convergence will be related to the rate of change of the slope of the curve, which is 
measured by the magnitude of the second derivative. 

An algorithmic statement of this method is shown below. 


Method of Linear Interpolation (Regula Falsi Method) 


To determine a root of f(x) = 0, given values of x, and x, such that f(x,) and 
f(x) are of opposite sign, 


REPEAT 


rs | 


Set xs = 22 — fe) FSFE 


IF _f(x3) of opposite sign to f(x;): 


UNTIL |/(z;)| < tolerance value. 
Note. The method may give a false root if f(x) is discontinuous on [x,, x2]. 


Table 1.3 


1.3: METHOD OF R INTER il 


Method of linear interpolation for f(x) = x? + x? — 3x —3=0 


Iteration 

number x X> X3 f(x) S(%) f(%) 
1 1.0 2.0 1.57142 —4.0 3.0 —1.36449 
2 1.57142 2.0 1.70540 —1.36449 3.0 —0.24784 
3 1.70540 2.0 1.72788 —0.24784 3.0 —0.03936 
4 1.72788 2.0 1.73140 —0,03936 3.0 —0.00615 
5 1.73140 2.0 1.73194* 


*Error in x, after five iterations is 0.00011 


Table 1.3 discloses a serious fault of the interpolation method: The approach to the 
root is one-sided. If f(x) has significant curvature between x, and x, this can be most 
damaging to the speed with which we approach the root, as shown in Fig. 1.6. 

A remedy for this is the modified linear interpolation method. We replace the value 
of f(x) at the stagnant end position with f(x)/2.* This helps, as Fig. 1.7 shows. 

An algorithm for this modification to the method of linear interpolation is shown 
below. 


Modified Linear Interpolation Method 


To determine a root of f(x) = 0, given values of x, and x5 such that f(x,) and 
f (xz) are of opposite sign, 


Set SAVE = f(x,); set Fl = f(x); set F2 = f(x). 


REPEAT 
Set x3 = x2 — ee 
he (x3) of opposite sign to Fl: 
Set x. = x3. 


Set F2 = f(x3). 
IF (x3) of same sign as SAVE: 
Set Fl = F1/2. 
ENDIF. 
ELSE Set x, = x3. 
Set Fl = f(x). 
IF (x3) of same sign as SAVE: 
Set F2 = F2/2. 
ENDIF. 
ENDIF. 
Set SAVE = f(x3). 
UNTIL | f(x3)| < tolerance value. 


*This halving of the ordinate at the other end of the interval is omitted when we step “beyond” the root, as 
occurs on the third iteration in Fig. 1.7. 
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Figure 1.6 


Figure 1.7 


Table 1.4 shows the modified method applied to our example polynomial problem. 
While the modified method normally converges faster, in this example there is no dif- 
ference. The changing values of x, and x, do better confine the root, however. 


Table 1.4 Modified linear interpolation for f(x) = x3 + x2 — 3x —-3 =0 


Iteration 

number x x x3 FI F2 F(x) SAVE 
1 1.0 2.0 1.57142 —4.0 3.0 —1.36449 —4.0 
2 1.57142 2.0 1.77557 —1.36449 Co 0.42369 — 1.36449 
3 1.57142 1.77557 1.72720 —1.36449 0.42369 —0.04576 0.42369 
4 1.72720 1.77557 1.73191 —0.04576 0.42369 —1.332 x 1073 —0.04576 
5 1.73191 1.77557 1.732183* —1.332 x 10-3 0.21184* 


“These function values are old F2/2 
‘Error in x, after five iterations is ~0.00013. 


Table 1.5 


Figure 1.8 
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Secant method for f(x) = x? + x? - 3x -3=0 
a 


Iteration 
number xy xX x f(xy) f(x) f(x) 
1 1.0 2.0 1.57142 —4.0 3.0 —1.36449 
2 2.0 1.57142 1.70540 3.0 — 1.36449 —0.24784 
3 1.57142 1.70540 1.73513 36449 —0.24784 0.02920 
4 1.70540 1.73513 1,73199 24784 0.02920 —0.0005755 
Py 1.73513 1.73199 1,73205* 


*Error in x, after five iterations is <107° 


There is one other way that we can improve the method of linear interpolation. 
Instead of requiring that the function have opposite signs at the two values used for 
interpolation, we can choose the two values nearest the root (as indicated by the magnitude 
of the function at the various points) and interpolate or extrapolate from these, Usually 
the nearest values to the root will be the last two values calculated. This makes the 
interval under consideration shorter and hence improves the assumption that the function 
can be represented by the line through the two points. 

Table 1.5 shows the calculations according to this scheme, which is known as the 
secant method.* The example illustrates a more rapid convergence: x6 is more accurate 
than was x, by linear interpolation. 

We leave the development of the algorithm for the secant method as an exercise for 
the student. 

It is important not to extrapolate to a root from two points whose functional values 
are of the same sign when knowledge is lacking that a real root is nearby, Figure 1.8 
illustrates the futility of searching for a root that is not there. This is especially important 
in a computer program, since the successive calculated values are usually not apparent 
as soon as they are computed, as they are in a hand computation, In addition, it will be 
observed that the secant method can lead to a division by zero when f(x) = f(x,). It 
may also shoot off to find a root different from the expected one. 


f(x) 


*So called because the line through two points on a curve is the secant line. 


i 
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The methods based on linear interpolation are not limited to polynomials, of course. 
Table 1.6 compares the methods so far discussed when each is used to find the root of 


the equation : 
an sine ~e* = 0. 


The trigonometric term is, of course, evaluated with the x-value in radians. Each method 
began with x, = 0 and x, = 1. 

Note that all the methods we have been using require an initial estimate of the root 
we are computing. It often requires as much thought and effort to get a good starting 
value as it does to refine it to acceptable accuracy. Sometimes one’s knowledge of the 
physical problem will suggest a starting value, When this is not available, one normally 
finds starting values by initial trial-and-error computations, or by making a rough graph 
of the function. We later discuss some methods that are self-starting for polynomials. 


NEWTON'S METHOD 


One of the most widely used methods of solving equations is Newton’s method.* Like 
the previous ones, this method is also based on a linear approximation of the function, 
but does so using a tangent to the curve. Figure 1.9 gives a graphical description. Starting 
from an initial estimate that is not too far from a root x, we extrapolate along the tangent 
to its intersection with the x-axis, and take that as the next approximation, This is continued 
until either the successive x-values are sufficiently close, or the value of the function is 
sufficiently near zero,’ 

The calculation scheme follows immediately from the right triangle shown in Fig. 
1.9, which has the angle of inclination of the tangent line to the curve at x = x, as one 
of its acute angles: 


f( 


or, in more general terms, 


*Newton did not publish an extensive discussion of this method, but he solved a cubic polynomial in Principia 
(1687). The version given here is considerably improved over his original example 
‘Which criterion should be used often depends on the particular physical problem to which the equation applies 
Customarily, agreement of successive x-values to a specified tolerance is required. 
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f(x) 


Figure 1.9 


Newton's algorithm is widely used because, at least in the near neighborhood of a 
root, it is more rapidly convergent than any of the methods so far discussed. We show 
in a later section that the method is quadratically convergent, by which we mean that the 
error of each step approaches a constant K times the square of the error of the previous 
step. The net result of this is that the number of decimal places of accuracy nearly doubles 
at each iteration. However, offsetting this is the need for two function evaluations at each 
step, f(x,) and f'(x,). 

When Newton’s method is applied to f(x) = 3x + sin x — e* = 0, we have the 
following calculations: 


F(x) = 3x + sin x — e*, 
f'@) =3 + cos x — e*. 


If we begin with x, = 0.0, we have 


=~ Paty = 0.0 Gay = 0:33333: 
ee i a _ =0.068418 _ : 

43 = A ~ Foggy = 033333 — So ggg3q = 0-36017: 

dress actA) es _ =6.279 x 10-4 _ . 

Xa = 83 ~ Figg = 0.36017 — 5 sa5q — = 0-3604217. 


After three iterations, the root is correct to seven significant digits. Comparing this 
with the results in Table 1.6, we see that Newton’s method converges considerably more 
rapidly than the previous methods. In comparing numerical methods, however, one usually 
counts the number of times functions must be evaluated. Because Newton’s method 
requires two function evaluations per step, the comparison is not as one-sided in favor 
of Newton’s method as at first appears; the three iterations with Newton’s method 


Figure 1.10 
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required six function evaluations. Five iterations with the previous methods required 


seven evaluations. 
A more formal statement of the algorithm for Newton’s method, suitable for imple- 


mentation in a computer program, is shown here. 


Newton’s Method 


To determine a root of f(x) = 0, given a value x, reasonably close to the root, 


Compute f(x), f(x). 
Set x, = x. 
IF (f(x) # 0) AND (f(x) # 0) 


REPEAT 
Set x; = x. 
Set x. = x, — f(x))/f'(x). 
UNTIL (|x, — x| < tolerance value 1) OR 
(| f(x3)| < tolerance value 2). 


Note: The method may converge to a root different from the expected one or 
diverge if the starting value is not close enough to the root. 


Newton's method can be applied to polynomial functions, of course, and special 
techniques facilitate such application. We consider these in a later section of this chapter. 

In some cases Newton's method will not converge. Figure 1.10 illustrates this sit- 
uation, Starting with x,, one never reaches the root r. We will develop the analytical 
condition for this in a later section and show that Newton’s method is quadratically 
convergent in most cases, 


“Yx) 
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MULLER’S METHOD 


Each of the root-finding methods that we have considered so far has approximated the 
function in the neighborhood of the root by a straight line. Obviously this is never true; 
if the function were linear, finding the root would take practically no effort. Muller’s 
method is based on approximating the function in the neighborhood of the root by a 
quadratic polynomial. This gives a much closer match to the actual curve. 

A second-degree polynomial is made to fit three points near a root, [xo, f(xp)], 
[x,, f(x,)], x2, f(%2)], and the proper zero of this quadratic, using the quadratic formula, 
is used as the improved estimate of the root. The process is then repeated using the set 
of three points nearest the root being evaluated. 

The procedure for Muller's method is developed by writing a quadratic equation that 
fits through three points in the vicinity of a root, in the form av? + by + c. (See Fig. 
1.11.) The development is simplified if we transform axes to pass through the middle 
point, by letting v = x — x. 

Let hy = x; — x9 and hy = x9 — x2. We evaluate the coefficients by evaluating 
p2(v) at the three points: 


v=0: a0)? + BO) +c = fy 
v=h: ahi + bhi t+co=f 
v = —hy: ah} — bhy + ¢ = fy. 


From the first equation, c = fo. Letting h)/h, = y. we can solve the other two 
equations for a and b: 


Wh - fil + y +h 
yhi(l + y) 


a= 


After computing a, b, and c, we solve for the root of av? + by + c = 0 by the 
quadratic formula, choosing the root nearest to the middle point xp. This value is 


with the sign in the denominator taken to give the largest absolute value of the denominator 
(that is, if b > 0, choose plus; if b < 0, choose minus; if b = 0, choose either). 

We take the root of the polynomial as one of a set of three points for the next 
approximation, taking the three points that are most closely spaced (that is, if the root 
is to the right of x9, take x9, x, and the root; if to the left, take x9, x2, and the root). 
We always reset the subscripts to make xp be the middle of the three values. 


| 
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Figure I.11 


An algorithm for Muller's method is 


Muller’s Method 
Given the points x , xo, x, in increasing value, 
Evaluate the corresponding function values f3, fo, f. 
Find the coefficients of the parabola determined by the three points. 
Compute the two roots of the parabolic equation. 
Choose the root closest to xg and label it x,. 
IF x, > x9 THEN rearrange xo, x,, x1 into x2, Xo, X} 
ELSE rearrange x, X,, Xo into x2, Xo, x1. 
IF | f(x,)| < FTOL, THEN RETURN (x,) 
ELSE go to |. 


EXAMPLE Finda root of f(x) = sin x — x/2 near x = 2.0. 


Let 
f(x) = —0.09070, A, = 0.2; 
f(x;) = —0.29150, hy = 0.2; 
f(x.) = 0.07385, y= 10 
Then 


_ (1.0)(—0.29150) — (—0.09070)(2.0) + 0.07385 _ 

(1.0)(0.2)7(2.0) - 

=0.29150 — (—0.09070) = (—0.45312)(0.2)? _ 
0.2 


—0.91338, 


c = —0.09070; 
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and 
2(—0.09070) 

Root = 2.0 — =1 . 
oa —0.91338'— V (0.91338)? — 4(—0.45312)(—0.09070) Lees 
For the next iteration, we have 

Xo = 1.89526, f(x) = 1.9184 x 10-4, hy, = 0.10474; 
x20, f(x;) = —0.09070, hy = 0.09526; 
Bots f(xy) = 0.07385, ¥ = 0.9095. 
Then 
_ (0.9095)(=0.09070) — (1.9184 x 1074(1.9095) + 0.07385 _ _4 sr 999 
. (0.9095)(0.10474)-(1.9095) : : 
_ ~0.09070 — 1.9184 x 107+ — (—0.47280)(0.10474)? _ _ 
b= TEnren = —0.81826, 
c = 1.9184 x 1074; 
and 
(2)(1.9184 x 1074) 
= 1.89526 — 
Batti ee Ces (0.81826)? — 4(—0.47280)(1.9184 x 10-4) 
= 1.895494.» 


After this second iteration, the root is accurate to seven significant digits. Experience 
shows that Muller's method converges at a rate that is similar to that for Newton's 
method.* It does not require the evaluation of derivatives, however, and (after we have 
obtained the starting values) needs only one function evaluation per iteration. There is 
an initial penalty in that one must evaluate the function three times, but this is frequently 
overcome by the time the required precision is attained. 


USE OF x = g(x) METHOD 


We now discuss another method that is of general applicability and that also lets us 
develop some necessary theory. We begin with the equation f(x) = 0, and rearrange it 
into an equivalent expression of the form 


x = g(x), such that if f(r) = 0, r = g(r). 


Under suitable conditions, which we will develop. the algorithm 


* Atkinson (1978) shows that each error is about proportional to the previous error to the 1.85th power. 
‘Some authors simply call the method the iteration method. 
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will converge to a zero of f(x). Consider a simple example: 
f@) =x? — 2x -3=0, 
which has obvious roots atx = 3,.x = —1. 
Rearranging yields 
x= V2x +3, 
so g(x) = V2x + 3. Starting with x = 4, we get 


= WIT = 3.516; 

x; = V9,632 = 3.104, 
x4 = V9.208 = 3.034, 
xs = V9.068 = 3.011, 
x6 = V9,022 = 3.004, 


i] 


W 


The various iterates appear to converge to x, = 3. 
The equation f(x) = x? — 2x — 3 = 0 can be rearranged in other ways also, For 
example, x = 3/(x — 2) is an alternative rearrangement of form x = g(x). If x; = 4, 


xX. = 1.5, 
Xq = —6, 
X4 = —0.375, 
Xs = —1.263, 
0.919, 
) —1.028, 
Xz, = —0.991, 
Xg = —1.003. 
Note that this converges, but to the root at x = —1, and that the iterates oscillate rather 
than converging monotonically. 
Consider a third rearrangement: 
#3 
Ps 2 
For x, = 4 we get 
XxX = 6.5, 
x3 = 19.635, 
X4 = 191.0, 


which obviously is diverging. 
Figure 1.12 illustrates the various cases; (a) shows monotonic convergence, (b) shows 
oscillatory convergence, and (c) shows divergence. For a function x = g(x), the solution 
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Figure 1.12 


(a) 


y r= 8x) 


(ec) 


is at the intersection of the line y, = x with the curve y) = g(x). In every case, we move 
vertically to the curve and then horizontally to the line, and repeat. The point r where 
g(r) = r is often called a fixed point of the function g. An algorithm for this method is 
shown on the facing page. 

We now study the conditions that are needed for convergence. We iterate with 


Xn+1 = 8%). 


Let x = r be a solution to f(x) = 0, so f(r) = 0 and r = g(r). Subtracting, and then 
multiplying and dividing by (x, — r), we have 
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Iteration with the Form x = g(x) 
To determine a root of f(x) = 0, given a value x, reasonably close to the root, 


Rearrange the equation to an equivalent form x = g(x). 
Set x. = x1. 
REPEAT 
Set x) = x2, 
Set x = g(x). 
UNTIL |x, — x| < tolerance value. 


Note: The method may converge to a root different from the expected one, or 
it may diverge. Different rearrangements will converge at different rates. 


8%) = 8) 


Mat — T= 2Oq) = Bi) = G@.= 7) 
Xn 


eae aH 
If g(x) and g'(x) are continuous on the interval from r to x,, the mean-value theorem* 


lets us write , 
Kae — = 8'(ENGn — 


where &,, lies between x, and r. 
If we define the error of the ith iterate as e; = x; — r, and e;; = xj+; — r, we can 
write 


Crea = a (Eder 
Taking absolute values, we get 
leisil = Ie'(&)I * leil- 


Now suppose that |g’(x)| = K < 1 for all values of x in an interval of radius h about 
r. If x, is chosen in this interval, x. will also be in the interval and the algorithm wil 
converge, since 


lensil S Klen] = K7Jena] S++ SK" 


e\|- 


In summary, if g(x) and g'(x) are continuous on an interval about a root r of the 
equation x = g(x), and if |g'(x)| < 1 for all x in the interval, then Xn4+1 = B(Xn)s 
n= 1, 2,3,..., will converge to the root x = r, provided that x, is chosen 
in the interval. Note that this is a sufficient condition only, since for some 
equations convergence is secured even though not all the conditions hold.* 


*Appendix A reviews certain calculus principles, including this theorem. 

‘The analytical test that |g'(x)| < 1 is often awkward to apply. A constructive test is merely to observe whether 
the successive x; values converge. In a computer program it is worthwhile to determine whether |x; — x2| < 
la — a]. 
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Since the preceding demonstration shows that e;,,; = g'(&) * e;, it is obvious that 
the rate of convergence is rapid if |g'(x)| is small in the interval. If the derivative is 
negative, the errors alternate in sign, giving oscillatory convergence. The curves of Figure 
1.12 give visual confirmation of this. As the iterates get closer to the root r, the values 
of g'(&) approach the constant value g’(r) because the é’s are squeezed into smaller 
intervals about r. In the limit, each error becomes proportional to the previous error, For 
this reason, the method is sometimes called linear iteration. ; 

Even though the proportionality between successive errors is true only in the limiting 
situation, if we assume that each of the errors is proportional to the previous one, we 
can develop an acceleration technique called Aitken acceleration that is often useful: 

Assume that 


=p KEE 
Cn = Xq_ — 1 = K" Fe, 
or 
=r+ Ke, 
Similarly, if 
€nt1 = Xnt1 — 7 = Ke), 
and 
Cnt2 = Ane — 1 = K"*e,, 
then 
Xai =r t+ Key, 
and 


Kiso = ir Oe). 


Substitute these expressions into 


It is found that 


XaXneg — 32a) _(r + K"e))r + K"*1e:) — (r + K%e))? 
Xnt2 — nar t+ X_ + K"*le,) — Ur + Ke) + (r + K""e;) 
AK"! — 2K" + K™ Ye, 


(K"T = 2K" + K™ Ye, 


From three successive estimates of the root, x), x2, and x3, we extrapolate to an 
improved estimate. Since the assumption of constant ratio between successive errors is 
not normally true, our extrapolated value is not exact, but it is usually improved. One 
uses this technique by calculating two new values, extrapolating again, and so on. 

A different form is useful to avoid the round-off problem that occurs in subtracting i) 
large numbers of nearly the same magnitude. Define 


Ax; = X41 — Xp 


Ax, 


A(Ax) = AQje) = x) = Xj42 — Bin. + %}- 
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Our acceleration scheme becomes 


5 2 
— Ax)? _ _ XnXn ta — Xn 


r=x, = : 
Ay Xen — nar + Xn 


The differences are most readily computed in a table. We illustrate with the iterates 
from the first example in this section: 


f(x) =? -2x-3= 
Xn41 = V2x, + 3 x =4 
x Ax A’x 
x, = 4.000 0.684 ‘oe 
x, = 3.316 0.212 472 
x3 = 3.104 
The accelerated estimate is 
_ (0.684)? _ 
r = 4.000 0472 > 3,009. 


We have jumped ahead about two iterations. If g(x) is expensive to compute, we have 
gained, Often Aitken acceleration gives bigger jumps than this. In fact, there is an excellent 
measure to determine when to use Aitken acceleration. 

Suppose for some 7 we have x, X;+), %»+2) X,+3- Then evaluate 


cn Et De 


(Sa-Su H al) 


where the sums are from i = n toi =n + 2. If C is close to +1, then Aitken acceleration 
is most effective. In the present example we have the values 


xp = 4.000, 
x, = 3.316, 
x) = 3.104, 
x; = 3.034. 


We find that with n = 0 in the formula, 
32.974 — 5(10.42)(9.454) 


V (36.631 — 36.1921)(29.8358 — 29.7927) 
= 0.99992. 


C= 


See Jones (1982) for this and further extensions for improving the acceleration method. 
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1.7 CONVERGENCE OF NEWTON’S METHOD 


We now use the result of the previous section to show a criterion for convergence of 
Newton's method. The algorithm 


fOCOF'R) — FOOF"D _ FODF"() 
LOOT) SOY = 


ama 5) ror: 
Hence if 
foOf") 
ror |<! 


on an interval about the root r, the method will converge for any initial value x, in the 
interval. The condition is sufficient only, and requires the usual continuity and existence 
of f(x) and its derivatives. Note that f’(x) must not be zero. 

Now we show that Newton’s method is quadratically convergent. Since r is a root 
of f(x) = 0, r = g(r). Since x,., = g(x,), we can write 


Xne1 — 7 = B%p) — Br). 


Let us expand g(x,) as a Taylor series* in terms of (x, — r), with the second-derivative 
term as the remainder: 


elt) = (7) + 8k, — 9) + Pex, — HF, 


where € lies in the interval from x, to r. 


Since 
jie 2. OG = 
g(r) LF 0 
because f(r) = 0 (r is a root), we have 
g'(é) 2 


8On) = 87) + Gn — 1: 


*See Appendix A for a review of Taylor series. 


1.8 
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Letting x, — r = e,, we have 


ened = Xn+1 — 


Each error is (in the limit) proportional to the square of the previous error; that is, Newton's 
method is quadratically convergent.* 


METHODS FOR POLYNOMIALS 


Polynomial functions are of special importance. We will see throughout the remainder 
of this book that many valuable numerical procedures are based on polynomials. This 
important role of polynomial functions is due to their “nice” behavior: They are everywhere 
continuous, they are smooth, their derivatives are also continuous and smooth, and they 
are readily evaluated, Descartes’ rule of signs (see Appendix A) lets us predict the number 
of positive roots. Polynomials are particularly well adapted to computers because the 
only mathematical operations they require for evaluation are addition, subtraction, and 
multiplication, all of which are speedy operations on computers. 

Because of this special importance of polynomials, we now consider how our root- 
finding methods can be applied to them. For most of the methods previously discusseu 
there is nothing new to say, but for Newton’s method there are significant new ideas to 
consider. We begin on a historical note, with a procedure that saves time in hand com- 
putations. However, we will see that this same procedure is also the basis for computer 
calculations as well. 

In applying Newton’s method to polynomials, it is most efficient to evaluate /(x,) 
and f'(x,) by use of synthetic division.’ We illustrate this by the same cubic polynomial 
that we used before, x° + x? — 3x — 3 = 0, which has a root at x = V3. We begin 
with the value x = 2. We utilize the remainder theorem to evaluate /(2), and evaluate 
f'(2) as the remainder when the reduced polynomial (of degree 2 here) is divided by 
a2) 


xy=2) 1 1 -3 -3 
2 6 6 
1 3 3 3 <— remainder = f(2) 
2 10 
1 5  13<—— second remainder = f'(2) 


*If f'(x) = 0 atx = r (hence a multiple root), the rate of convergence will not be quadratic. For a multiple 
root, it can be shown that convergence is linear. 

‘The mechanics of synthetic division, whereby we divide a polynomial by the factor x — x, are explained in 
most algebra books. In the example, when x? + x? — 3x — 3 is divided by (x — 2), the result is x2 + 3x + 
3, with a remainder of 3 
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n=2-4- 1.76923... 
x) = 1.76923} 1 1 -3 -3 
1.76923 4.89940 _ 3.36048 
1 2.76923 1.89940 0.36048 
1.76923 8.02957 
1 4.53846 9.92897 
0.36048 
3 = 1.10928 = peas — 1nd 202, 
Similarly, 
x4 = 1.73292 - a = 1.73205. 


The value of x, is correct to five decimals. To observe the improvement in accuracy, 
consider the successive errors: 


Error Number of Correct Figures 
x =2 0.26895 1 
xX) = 1.76923 0.03718 2 
x3 = 1.73292 0.00087 a 
xs = 1.73205 0.00000 6+ 


In order to compute with five decimal places, as in this example, we used a desk 
calculator. (If you have access to a calculator with storage for two or more values, you 
will find it especially well adapted to this method.) 

The initial value at which Newton’s method is begun can make a considerable 
difference. For example, if this problem is started with x = 1, the following values result: 


x f(x) Sf'(x) 
x= —4 2 
x =3 24 30 
x3 = 2.2 5.888 15.92 
x4 = 1.83015 


From here on, the convergence is rapid, for we are using iterates just about as near 
the root as in the previous example. 

After a first root is found (as shown by a remainder that is very small), one normally 
proceeds to determine additional roots from the reduced polynomial (whose coefficients 
are in the third row of the synthetic-division tableau). This makes the computations 
somewhat shorter. In the example, the reduced equation is a quadratic, so the quadratic 
formula would be used, but if a higher-degree polynomial were being solved, Newton's 
method employing synthetic division would be employed to improve an initial estimate 
of a second root. The process is then repeated until the reduced equation is of second 
degree. 
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This technique of working with the reduced function can be used even if the function 
is not a polynomial. After a root r of f(x) = 0 has been found, the new function F(x) = 
f(x)/(x — r) will have all the roots of f(x) except the root r. This procedure is called 
deflating the function. One must remember that a discontinuity has been introduced at 
x = r, however. We suggest that you explore how deflation works on nonpoly- 
nomial functions by graphing f(x) = (x — 1)(e* — cos x), then g(x) = f(x)/x, hQx) = 
f(d/(x — 1), and comparing the graphs. You may also wish to compare to /(x)/x and to 
functions derived by deflating f(x) with other roots than x = 0 and x = 1. 

It should be observed that using deflated functions can result in unexpected errors. 
If the first root is determined only approximately, the coefficients of the reduced equation 
are themselves not exact and the succeeding roots are subject not only to round-off errors 
and the errors that occur when iterations are terminated too soon, but also to inherited 
errors residing in the nonexact coefficients. Some functions are extremely sensitive in 
that small changes in the value of the coefficients cause large differences in the roots. 
Removing roots in order of increasing magnitude is said to minimize the difficulty, and 
the use of double-precision arithmetic will further help preserve accuracy. 

It is of interest to develop the synthetic-division algorithm and to establish the 
remainder theorems. The scheme is also the most efficient way to evaluate polynomials 
and their derivatives in a computer program, 

Write the nth-degree polynomial as 


P(x) = ax" + agx""! + 0 + + ax + Gn4y. 


We wish to divide this by the factor (x — x), giving a reduced polynomial Q,,- \(x) of 
degree n — 1, and a remainder, b,,,,, which is a constant: 


P(X) b, 
ae = e) + ae 
xX Oni) t— 2 


Rearranging yields 

P(x) = (x = Xy)Qn-\(X) + bay 
Note.that at x = x), 

PAX) = (OLQ,-1Q)] + Orvis 


which is the remainder theorem: The remainder on division by (x — x;) is the value of 
the polynomial at x = x, P,(x;). 
If we differentiate the last equation, we get 


Pix) = (x = xy)Oj-1(X) + (Qn )(x) + 0. 
Letting x = x,, we have 
Pi(%1) = Q,- 1%). 


We evaluate the Q-polynomial at x, by a second division whose remainder equals Q,,- ;(x,). 
This verifies that the second remainder from synthetic division yields the value for the 
derivative of the polynomial. 
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We now develop the synthetic-division algorithm, writing Q,,- ;(x) in form similar 
to P(x): i 
P(x) = ax" + ax"! + ++ + agx + Guay 
= (x — X)Qn-1(0) + Bast 
== bp Fb ee Bix + Bg) Dae: 


Multiplying out and equating coefficients of like terms in x, we get 


a=bh 
> ay = by — byxy 


2 a3 = by — box, 


2 Oy = by — By\% 


Qn+1 = bns1 — by 


The general form is bj = a; + b;-,x,, by which all the b’s except b, may be calculated. 
If this is compared to the synthetic divisions above, it is seen to be identical, except that 
we now have a vertical array. The horizontal layout is easier for hand computation. For 
evaluation of the derivative, a set of c-values is computed from the b’s in the same way 
in which the b’s are computed from the a’s. 

Synthetic division is also known as the nested multiplication method of evaluating 
polynomials. Consider the fifth-degree polynomial, evaluated at x = x,: 


ayy no anx} + a3x} att agx7 + asx, + dg. 


We can rewrite this as 


((((ayxX, + @z)xX, + a3)X, + ay) xy + As) Xy + Ap. 


In the original form, 5 + 4 + 3 + 2 + 1 = 15 multiplications are required, plus five 
additions. In the nested form, only five multiplications are required, plus five additions; 
it is obviously the more efficient method. 

Comparing this with the equations b, = a, + b,x, and b; = a; + b,x, for synthetic 
division, we see that the successive terms are formed in exactly the same way, so that 
synthetic division and nested multiplication are two names for the same thing. 
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BAIRSTOW’S METHOD FOR QUADRATIC FACTORS 


The methods considered so far are difficult to use to find a complex root of a polynomial. 
It is true that Newton’s and Muller’s methods work satisfactorily, provided that one begins. 
with initial estimates that are complex-valued; however, in a hand computation, per- 
forming the multiplications and divisions of complex numbers is awkward. There is no 
problem in a computer program if complex arithmetic capabilities exist, but the execution 
is slower. 

For polynomials, the complex roots occur in conjugate pairs if the coefficients are 
all real-valued. For this case, if we extract the quadratic factors that are the products of 
the pairs of complex roots, we can avoid complex arithmetic because such quadratic 
factors have real coefficients. We first develop the algorithm for synthetic division by a 
trial quadratic, x? — rx — s, which is hopefully near to the desired factor of the polynomial: 


Py(X) = yx" + ax" + + Ayay 
= (x? — rx — 9)Q,,-2(x) + remainder 
= (x2 — re — s)(byx"-2 + byx"3 + 0+ + By ax + Dy-1) 


+ bx — 1) + Baa t 


(The remainder is the linear term b,(x — r) + b,,4), written in this form to provide later 
simplicity. If x? — rx — s is an exact divisor of P,(x), then b, and b,,,, will both be 
zero,) The negative signs in the factor are also for later simplification. 

On multiplying out and equating coefficients of like powers of x, we get 


a 
a, = by — rb, by = ay + rb, 
a3 = b; — rby — sb, bs = a3 + rby + sb, 
ay = by — rby — shy or by = ay + rby + sbz (1.1) 
y= by — Thy, — Sby—z by = Ay + Php) + Sbp-2 
Qn+1 = Dna) — thy — Sby-1 Patt = Ane, + ry + Sby-\. 


We would like both b,, and b,,,, to be zero, for that would show x* — rx — s to be 
a quadratic factor of the polynomial. This will normally not be so; if we properly change 
the values of r and s, we can make the remainder zero, or at least make its coefficients 
smaller. Obviously 6, and b,,,, are both functions of the two parameters r and s. Expand- 
ing these as a Taylor series for a function of two variables* in terms of (r* — r) and 


*Appendix A reviews this. 
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(s* — s) where (r* — r) and (s* — s) are presumed small so that terms of higher order 
than the first are negligible, we obtain 


ab, 
by(r*, S*) = g(r, 8) + Er — r) + age =) Se 


ab, ab, 
by+i(r*, S*) = Dyas i(r, 8) + aaa inet aaa aioe —s)tere 


Let us take (r*, s*) as the point at which the remainder is zero, and 
r= 7 = Ar; . -s* = S= As, 


(Ar and As are increments to add to the original r and s to get the new values r* and 
s* for which the remainder is zero.) Then 


ab, 


as As, 


b,(r*, s*) = 0 = b, + is eee 


dba Ar + dbne1 As. 
ir 


bys ilr*, s*) = 0 = bas + a 


All the terms on the right are to be evaluated at (r, s). We wish to solve these two 
equations simultaneously for the unknown Ar and As, so we need to evaluate the partial 
derivatives. 

Bairstow showed that the required partial derivatives can be obtained from the b’s 
by a second synthetic division by the factor x? — rs — s in just the same way that the 
b’s are obtained from the a’s. Define a set of c’s by the relations shown below at the 
left, and compare these to the partial derivatives in the right columns: 


90; day 


“3 nae ae cane 
ob ab db, _ day ab 
by + rey gore iby Sib SNe G2 = 52+ Fl = 
abs _ ab ome 
by + rey + sey tee the Bs = Bry + v, 
= by = c) 
ab, ab ab» ab, ab; . ab 
bg + rex + sez ap aap tubagh tas mera th Sag tO 
= by + rez + sey = €3 =b) + re, = C2 
ape rapes ab,» abe abet). sabes 
by + 1Cy—1 + SCy-2 aps aot + Dp + 5 a ae aes ae hbase 


= by-1 + Fen—2 + SCy-3 = by-2 + rey—3 + SCp—4 


i 
il 


Cn-1 €n-2 
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Hence the partial derivatives that we need are equal to the properly corresponding c’s. 
Our simultaneous equations become, where Ar and As are unknowns to be solved for, 


= c,-\Ar + ¢,-2As, 


=c,Ar + c,-\As, 


We express the solution as ratios of determinants: 


A statement of the algorithm for Bairstow’s method is given in Section 1.15. 


EXAMPLE Find the quadratic factors of 
xt — 1.1x3 + 2.3x? + 05x + 3.3 = 0. 


Use x? + x + Las starting factor (7 = —1, 9 = -1). (Frequently r = s = 0 are used 


as starting values if no information as to an approximate factor is known.) Equations 
(1.1) lead to a double synthetic-division scheme as follows: 


ay a a ay as 
a 23 0.5 3.3 
-1.0 21 =34 0.8 
= -1.0 21 3.4 
tL 3 =8 3.4 0.8 0.7, 
=1;0 Sy (SSIS a 
= ~1.0 Bult « Pst 
x 
1 =3.1 5 32 by 
Cn-2 Cn-1 \, 
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Note that the equations for b, and c) have no term involving s. The dashes in the preceding 
tableau represent these missing factors. Then 


0.8 a 
cet a 2:25. 


Ar= 3523.11 ~ 2033 > 0.11, r* =—1+ 0.11 = —0.89, 
aay) ears 
55 0.8 
|-32 -04| ~1.29 
= -—_____t ~ __“* = + *=-]- =- 
As 20.33 70.33 0.06, s 1 — 0.06 1.06. 
The second trial yields 
1 =1L1 2:3 0.5 3.3 
—0.89 —-0.89 1.77 —2.68 0.06 
—1.06 _ —1.06 2.11 =3:17 
1 =1.99 3.01 —-0.07 0.17 
—0.89 2.56 —4.01 
_— —1.06 3.05 
—2.88 4.51 —1.03 
| 0.07 Fo 
-0.17 4.51 —0.175 
Ar= 4.51 22.881 > 173747 -0.010, r* = —0.89 — 0.010 = —0.900, 
-1.03 4,51 
| 4.51 0.17 
=1/03) O17 ~0.694 
As = ——a374——— 17.374 ~ ~0-040, s* = —1,06 — 0.040 = —1.100. 


The exact factors are (x? + 0.9x + 1.1)(a? — 2x + 3). 8 


1.10 OTHER METHODS FOR POLYNOMIALS 


In this section we discuss two methods that do not seem to be widely used but that have 
the special advantage of not requiring a reasonably good starting value. We first discuss 
the QD algorithm, then mention Graeffe’s method. 

A relatively efficient method to determine all the roots of a polynomial without starting 
values is the QD or quotient-difference algorithm. We present the method without 
elaboration.* 


*Henrici (1964) discusses the method in some detail. 
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For the nth-degree polynomial 
P(x) = yx” + ax"! + 0+ + + gx + Ane 


we form an array of q and e terms, starting the tableau by calculating a first row of q's 
and a second row of e’s: 


g? = -a3/ay, all other q's are zero, 
9 = aisa/diny, = 1,2,...,0- 1, 
eo = eM = 90, 


The start of the array is 


2 2 = 
2 = gD eg gq? vee Dd gr) em 


ay 


ay 


0 Q ove. 0 
ay a4 
ay ay ay 
A new row of q’s is computed by the equation 
New g!) = el) = ef + gi, 


using terms from the e and q rows just above, Note that this algorithm is “e to right 
minus e to left plus q above.” 
A new row of e’s is now computed by the equation 


(itd) 
New e) = (Sr Je, 


“q to right over q to left times e above.” The example in Table 1.7 isolates the roots of 
the quartic 


P4(x) = 128x* — 256x3 + 160x? — 32x + 1 


by continuing to compute rows of q’s and then e’s until all the e-values approach zero. 
When this occurs, the q-values assume the values of the roots, Since the method is slow 
to converge, it is generally used only to get approximate values, which are then improved 
by Newton’s method. 

If the polynomial has a pair of conjugate complex roots, one of the e’s will not 
approach zero but will fluctuate in value. The sum of the two q-values on either side of 
this e will approach r and the product of the q above and to the left times the q below 
and to the right approaches —s in the factor x? — rx — s. Two equal roots behave similarly. 

Table 1.8 shows the result of the method for the polynomial 


(x — 1)(x — 4)(x? — x + 3) = x4 — 6x3 + 12x? — 19x + 12. 
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Table 1.7 Example of QD method for P(x) = 128x* — 256x? + 160x? — 32x + | 


A) g@ oe) gq?) 2) ri 2) q” et) 

2.000 0 0 0 

0 —0.625 —0.200 —0.031 0 
1.375 0.425 0.169 0.031 

0 —0.193 —0.079 —0.006 0 
1.182 0.539 0.242 0,037 

0 —0.088 —0.036 —0.001 0 
1.094 0.591 0.277 0.038 

0 —0.048 —0,017 —0.000 0 
1.046 0.622 0.294 0,038 

0 —0.028 —0.008 —0.000 0 
1.018 0.642 0.302 0.038 

0 —0.018 —0.004 —0,000 0 
1.000 0.656 0.304 0.038 

0 -0.012 —0.002 —0.000 0 
0.988 0.666 0.306 0.038 

0 —0,008 —0.001 —0.000 0 
0.980 0.673 0.307 0.038 

0 —0.005 —0.001 —0.000 0 
0.975 0.677 0.308 0.038* 


*The true values of the roots are 0.96194, 0.69134, 0.30866, and 0.03806, 


Factors are (x — 4)(x — I)(x? — x + 3). 


q'” converging to 4. 
q converging to 1. 


Since e does not approach zero, q'?) and q®) represent a quadratic factor: 


r = q® + q® = 1.456 — 0.466 = 0.990; 


s = —(—6.426)(—0.466) = —2.995. 


This quadratic factor is x2 — rx — s = x7 — 0.990x — (—2.995). 

Note that one cannot compute the first q and e rows if one of the coefficients in the 
polynomial is zero, for division by zero is undefined. In such a case, we change the 
variable to y = x — 1. (Subtracting 1 from the roots of the equation is an arbitrary choice, 
but this facilitates the reverse change of variable to get the roots of the original equation 
after the roots of the new equation in y have been found.) 


1,10: OTHER METHODS FOR POLYNOMIALS 37 


Table |.8 QD method with complex roots, for P(x) = x4 — 6x? + 12x? — 19x + 12 


ra gg) el q? e® Cid & q? ee) 
6.000. 0 0 0 
0 —2.000 — 1.583 —0,632 0 
4.000 0.417 0.951 0.632 
0 —0.208 =3,610: —0.420 0 
3.792 —2,985 4,141 1,052 
0 0.164 5.008 —0.107 0 
3,956 1.859 —0.974 1,159 
0 0,077 —2.624 0.127 0 
4,033 —0,842 1,777 1,032 
0 —0,016 5,538 0.074 0 
4,017 4.712 —3,687 0,958 
0 —0.019 —4.333 —0,019 0 
3.998 0,398 0.627 0,977 
0 —0,002 —6.826 —0.030 0 
4.000 —6.426 7.423 1,007 
0 0,003 7.885 —0.004 0 
4.003 1,456 —0,466 1,010 
For example, if f(x) = x4 — 2x? + x — 1 = 0, we let y = x — | and use repeated 


synthetic division to determine the coefficients of f(y) = 0. The successive remainders 
on dividing by x — | are the coefficients of f(y): 


1 0 -2 1 —| [1 
1 I -1 0 
I 1 -1 0 -® 
1 2 I 
1 p 1 ® 
1 3 
1 3 ® 
1 
1 ® 
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Therefore, 
wo f(y) = y* + 4y3 + 4y? + y — 1. 


We proceed to find the roots of f(y) = 0, and then get the roots of f(x) = 0 by 
adding 1. 

Graeffe’s method finds values for all the roots of a polynomial directly from its 
coefficients without requiring starting values. It is based on the fact that if the roots are 
all different and widely separated, then for the polynomial 


P(x) = ax” + agx" "| + +++ + a,x + Gast, 


the roots are given by 


2 Ans 
rr, = -—~, n=-—, ee r= -—. .2)* 
1 a 2 a n ra (1.2) 
In order to separate the roots of the given polynomial, it is converted to another 
polynomial whose roots are the negative squares of the original roots.’ After enough 
repetitions of the root-squaring operation the relations of Eq. (1.2) give the values of the 
roots, provided no multiple roots occur. 


COMPUTER ARITHMETIC AND ERRORS 


Up to now we have not anticipated that you would use computer programs to solve the 
exercises. Many of the methods presented in this and the following chapters can readily 
be done with a calculator. The reason we have avoided a discussion of computers so far 
was to let you concentrate on the algorithms. However, in your professional careers you 
may need to solve large problems quickly, handle a large amount of data, or use certain 
methods frequently. To carry this out, a computer is essential. To use one, you will either 
write your own programs to implement the methods or make use of some of the excellent 
software already available.* In this section and the next we examine the 
limitations of the computer as well as discussing programming languages you could use. 
A person with a solid knowledge of mathematics, of a programming language, and of 
the computer will find numerical analysis a powerful asset in solving real-world problems. 
Many programs are presented in this book. These are meant mainly to show the 
implementation of the algorithms and are not truly professional codes. They should be 
useful to you as easily understood programming examples as well as in solving the 
exercises if you do not have access to libraries of numerical analysis subroutines. 


*We use the notation r; = —a/a, to indicate approximate equality of the two quantities. 

‘See Scarborough (1950) for more information. 

* Appendix C discusses some software that is generally available, both for personal computers and larger multiuser 
systems. 
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We have previously observed that it is essential not to neglect the study of the errors 
of our numerical techniques. For several of our methods, we have analyzed how the 
errors of successive iterations decrease (or increase if the method diverges). It is time 
that we look more deeply into the various sources of error that interact to affect the 
accuracy of our result. We can list these sources as: 


TRUNCATION ERROR 


This term is given to the errors caused by the method itself (the term originates from the 
fact that numerical methods can usually be compared to a truncated Taylor series) and is 
the error to which we have so far paid most attention. For instance we may approximate 
e* by the cubic 
x x 
p(x) = 1 +h tata 
However, we know that to compute e* really requires an infinitely long series: 


= 


x" 
et = psx) + 2 
f 


Note that an approximation of e* with the cubic gives an inexact answer. The error is 
due to truncating the series and has nothing to do with a computer or calculator. For 
iterative methods, this error usually can be reduced by repeated iterations, but since life 
is finite and computer time is expensive. we must be satisfied with approximations to the 
exact analytical answer. 


ROUND-OFF ERROR 


All computing devices represent numbers with some imprecision, Digital computers, 
which are the normal devices for implementing numerical methods, will nearly always 
use floating-point numbers with a fixed word length. The true values are not exactly 
expressed by such representations. We call this error round-off. whether the decimal 
fraction is rounded or chopped after the final digit. We discuss this in more detail below. 


ERROR IN ORIGINAL DATA 


Real-world problems, in which an existing or proposed physical situation is modeled by 
a mathematical equation, frequently have coefficients that are imperfectly known. The 
model itself may not perfectly reflect the behavior of the situation either. The numerical 
analyst can do nothing to overcome such errors by any choice of method, but he or she 
needs to be aware of such uncertainties; in particular. one may need to perform tests to 
see how sensitive the results are to changes in the input information. Since the reason 
for performing the computation is to permit some decision with validity in the real world. 
sensitivity analysis is of extreme importance. As Hamming says. “the purpose of com- 
puting is insight, not numbers.” 
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It is anticipated that you. will use a digital computer (or at least a programmable calculator) 
in your professional use of numerical analysis. You will probably use such computing 
tools extensively while you are learning the topics covered im this text. Such machines 
make mistakes only very infrequently, but since humans are imvolved in programming. 
operation, preparing the input, and interpreting the output. blunders or gross errors do 
occur more frequently than we like to admit. The solution here is care. coupled with a 
careful examination of the results for reasonableness. Sometimes a test run with known 
results is worthwhile, but this is no guarantee of freedom from foolish error. When hand 
computation was more common, check sums were frequently computed—these would 
ordinarily reveal the mistake and permit its correction. 


PROPAGATED ERROR 


This is more subtle than the other errors. By propagated error we mean the error in the 
succeeding steps of the process due to the occurrence of an earlier error. This is in addition 
to the local error made at that step: it is somewhat analogous to errors im the initial 
conditions. The methods discussed in this chapter do not reflect this type of error, except 
in the case of finding additional zeros of a function using the reduced or defiated equation. 
Here the reduced equations reflect errors in the previous stages. In other examples of 
numerical methods treated in later chapters, propagated error is of critical importance. If 
errors are magnified continuously as the method continues, eventually they will completely 
overshadow the true value, destroying its validity: we call such a method unstable. (For 
a stable method—the desirable kind—errors made at early points die out as the method 
continues. This will be covered more thoroughly in later chapters.) 


Each of these types of error. while interacting to a degree, may occur even in the 
absence of the other kinds. For example. round-off error will occur even if truncation 
error is absent. as in an analytical method. Likewise. truncation errors would cause 
inaccuracies even if one could attain perfect precision in the calculations. The usual error 
analysis of a numerical method treats the truncation error as if such perfect precision did 
exist. 


1.12 FLOATING-POINT ARITHMETIC AND ERROR ESTIMATES 


To examine round-off error in detail, we need to understand how numeric quantifies are 
represented in computers. In nearly all cases. they are stored as floating-point quantities. 
which are very much like scientific notation.* Different computers use slightly different 
techniques. but the general procedure is similar. 


* Another name often used for floating-point number is rea! mumber, bat we bere reserve the tem real for the 
continuous (and infinite) set of numbers on the “number line When printed as 2 sumber with 2 decimal pount, 
it is called fixed-point. The essential concept is that these are im contrast to imiegers. 


EXAMPLE 
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We can represent a floating-point number in the general form 
= sdydy... d * BY, (1.3) 


where d;’s are digits or bits with 0 = d; = B — 1; and 


B: the number base being used (usually either 2, 16, or 10) 
p: the number of significant digits (bits)—that is, precision 


e: the integer exponent, with a range of values defined on the interval (Emin, 
Emax] (ordinarily the range will include negative as well as positive values) 


f: d\d,. . . d,, the fractional part of the number 


For hand calculators the base B is usually 10. In computers this base is often 2, but 
other bases, such as 16, are also used. 
Suppose B = 10 and p = 4. Then these numbers would be represented as 


27.39 => = +.2739 x 10%; 
0.00124 — -.1240 x 1072; 
37000 => +.3700 x 10°. « 


In this example we have required that the first element in f, d) = 0. When this is true 
for a floating-point number system, we refer to them as normalized floating-point numbers. 
Moreover, although the number of reals on any finite interval is infinite, this is not true 
for floating-point numbers. It turns out that the actual number of normalized floating- 
point numbers represented in (1.3) is 


2 * (B — 1) * BP"! * (Emax — Emin + 1) + 1, (1.4) 


where the last term, +1, is for 0.0. In computers, a floating-point number system is the 
IEEE standard. Two levels of precision are defined: 


Single-precision format (32 bits long): 
1 bit 8 bits 23 bits 


+ e did... .d, 


B = 2, e in [—128,127], p = 23 
Double-precision format (64 bits long): 


1 bit 11 bits 52 bits 
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B = 2, e in [—1024, 1023], p = 52 


The first bit represents the sign of the number. The sign of the exponent is handled by 
biasing the exponent value—that is, adding 128 (or 1024) to it so that unsigned values 
from 0 to 255 (2047) are actually stored. From (1.4), the number of normalized floating- 
point numbers in these formats is 


Single format: 2* 1 * 22 «(27 — (-2)) + 122.14 x 10° 
Double format: 21 * 25! *(2!° — (-2!)) +1=9.11 x 108 


There are other methods for storing floating-point numbers. In large IBM computers, 
for example, B = 16, with 24 bits (6 hex digits) for the fraction part and 7 bits (biased 
64) for the binary exponent. Control Data CYBER computers have 60-bit words, 48 bits 
for the fraction part and 11 for exponent with B = 2. Some computers with B = 2 gain 
an extra bit of precision by not storing d, (which is always a 1 for normalized fractions), 
so it can be omitted in the stored numbers. It is then referred to as the hidden bit. 

We now examine the question of the distribution of normalized floating-point num- 
bers. Here we consider the case where B = 2, p = 2, and -2 = e <3. In this example, 
all the normalized numbers would be of the form 


21007 atile?, —2ses 3: 


Since for binary fractions, .10 = } and .11 = 4 + } =3, these numbers range from 
—6 to +6 (.11 = 23 = 6 and —.11 * > = —6). A list of all the positive numbers in this 
system is 
jAL*22 = 
#27! 
l1+2 
1+ 2! 
Al+2 


ton ow 


Rlwalbalen 
DH wo ere ooestas 


jAL+*23 


il 


In the diagram, we see the distribution of all of the nonnegative floating-point numbers 
on the interval [0. 6]. 


There are 25 total numbers in this tiny floating-point system, exactly as given by the 
formula 


24(2— 1I)# 2 eB — (-2)+ 1) +1 =24142464+1=25. 


It is hard for us to think in number bases other than 10, so we will discuss the 
arithmetic accuracy of floating-point operations using normalized base-10 numbers. The 
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other number bases that are actually used in computers behave analogously. To simplify 
our treatment, assume only three digits in the fraction part, and one decimal digit for the 
exponent. We supply signs for the fraction and the exponent as separate symbols. For 
these examples, 8 = 10, p = 3, e in [—9, 9]. We compare rounding and chopping the 
results. 

When two floating-point numbers are added or subtracted, the digits in the fraction 
of the number with the smaller exponent must be shifted to align the decimal points; the 
sum may need to be shifted and the exponent adjusted to normalize the result. For example, 


1. 137 % 10! + .269 * 107': 


137 * 10! 
+,00269 ~ 10! Align decimal points 


13969 ~ 10'—> — Chopped = 139 % 108 
0005 


14019 4 10!» — Rounded* = 140 7 10! 


+ 


Z; 485 7 \0t — 482 ~ 104: 
465 x 10F 
—.482 x 108 
003 7 108 
300 7 10" Normalized + Chopped = 200 7 108 
+ 0005 


3005 4 10? —> — Rounded = 300 7 108 


a: 378% 10% + .727 7 10h: 
376 ~ 108 
+.727 x 10° 
1.105 ~ 104 
ANOS 7 10% Normalized + Chopped = 110 % 10% 
+ 0005 


WLIO % 105 —> == Rounded = 111 * 10° 


Observe that rounding requires an extra operation, which can be done with either 
hardware or software. Either way adds cost, so most computers chop rather than round, 
Pay particular attention to the loss of precision in the second example. There is only one 
digit of accuracy in the result, although the difference is represented as if the trailing 
zeros were significant. This loss of significance when two nearly equal numbers are 
subtracted is the major reason for the loss of accuracy in floating-point operations.” 


*To round to three digits after the decimal point, we add 0.0005 and then chop the digits beyond three 

"One way to detect such insignificant zeros is to provide for changing the entering digit on a left shift of the 
fraction during normalizing. Two runs with different fill digits will then give different results. This is pretty 
costly. Another way (also costly) is to repeat the computer run using double precision, and compare the results 
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Multiplication of two n-digit numbers gives a result of 2n digits. The floating-point 
registers of a computer* normally allow for this, converting the 2n-digit product to an n- 
digit result. Such doubfélength registers are also used in division. Some examples: 


4. 403 x 10° + .197 x 1071: 
Multiply fractions Add exponents 
403 
+197 =1 
079391 5 
079391 x 105 
.79391 x 10+ Normalized > Chopped = .793 x 10* 
+.0005 


-79441 x 10*—> — Rounded = .794 x 10¢ 


5. .356 X 10°? = .156 x 10*: 


Divide fractions Subtract exponents 
356 —2 
+ .156 =4 
2.28205 -6 


2.28205 x 1076 Normalized > Chopped = .228 x 10°? 
-22820 x 1077 
+ .0005 


22870 x 10-7—> Rounded = .228 x 10°7 


In multiplication and division, the initial shifting to align decimal points can be 
omitted. The multiplication step is usually considerably slower than addition; the overall 
time for floating-point multiplication is typically from 2.5 to 10 times that for floating- 
point addition or subtraction. Floating-point division is usually the slowest of all (4 to 
25 times that for addition). These timing differences are even greater when software 
routines are required for multiplication and division. 

Many computers provide for two (or even three) levels of precision by using a double- 
length word for double precision. Usually the number of bits used for the exponent 
remains the same, so the number of digits in the fraction more than doubles.” 

Since decimal numbers are usually converted to a different number base when stored 
as a floating-point value, the number of significant decimal digits equivalent to the 
accuracy of the fraction is not an integer. Different computer manufacturers have 
approached the problem of floating-point representation with wide variations of word 
lengths, so the accuracy provided is considerably different (from 6 to 16 equivalent decimal 
digits in single precision, for example). 


*Not all computers have hardware to multiply or divide floating-point numbers. This is particularly true for 
microprocessors. In such cases, software routines are used. 

“Double precision therefore requires twice the memory space to store the numbers. The execution time is also 
increased, from 25% to several fold. 
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Converting numbers to the computer’s internal number base will often introduce some 
error. Terminating decimal fractions may be nonterminating in binary [(0.6)jo 
(0.100110011001 . . .), for example]. In addition, representing floating-point numbers 
in a fixed finite word length has many “gaps” in its number set, as we have seen. In 
effect, we have to map the infinite set of mathematical reals into a finite set of computer 
numbers. For our simplified example of three-digit fractions, there are only 900 different 
fraction values—all the mathematical numbers between 0.1 and 1.0 must be translated 
to one of these 900 values, In each decade, as represented by a constant value of the 
exponent, there also are only 900 different values. The spacing between values in the 
different decades is therefore different. 

Zero is a special situation among the floating-point numbers. It cannot be normalized, 
of course, so special conventions are adopted, In most systems, the fraction digits are 
all zeros. The exponent must also be set to the most negative value; if this is not done, 
alignment of decimal points when adding will shift out significant digits from the addend. 
Zero is relatively isolated from the other values. In our simplified example system, the 
nearest neighbors to zero are +0.1 * 10~°. Trying to represent any magnitudes smaller 
than this causes a program error called exponent underflow. In some FORTRAN systems, 
on underflow the number is replaced by floating-point zero and execution continues. This 
may produce other errors if subsequent division is performed. Similarly, an attempt to 
represent numbers of magnitude larger than 0.999 * 10° in our system gives rise to 
exponent overflow. Normally, this terminates execution, but some systems replace the 
number with the largest possible floating-point quantity and then continue. 

Peculiar things happen in floating-point arithmetic. For example, adding 0.001 one 
thousand times usually does not equal 1.0 exactly. In some instances, multiplying a 
number by unity does not reproduce the number. In many calculations, changing the 
order in which operations are performed will produce different results. 


ABSOLUTE VERSUS RELATIVE ERROR, SIGNIFICANT DIGITS 


We introduced the terms absolute and relative error informally in Section 1.2. More 
formally, the absolute error of a given result is usually defined as 


Absolute error = true value — approximate value 


so we get the true value by adding the absolute error to the approximation. However, a 
given error is usually much more serious when the magnitude of the true value is small. 
For example, 1036.52 + 0.010 is accurate to five significant digits and is frequently of 
more than adequate precision, while 0.005 + 0.010 is a clear disaster. The relative error, 


true value — approximate value 
true value 


Relative error = 
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EXAMPLE 


1.13 


is often a better imdicator of the accuracy. it is more nearly scale-independent. a most 
desirable property. When the rue value is zero. the relative exror is undefined. It follows 
that the round-off error-due to finite fraction lensth im fioating-pomt numbers is more 
neatly constant when expressed as relative error than when expressed as absolute error. 
Observe that the loss of significant digits, when small, nearly equal fioating-poimt numbers 
ae subtracted. produces 2 particularly severe relative error. (Note that others define both 
absolute exor and relative error im temns of absolute values, so that these errors can be 
only positive, or zero.) 

In addition to the concept of relative and absolute eror, we have used the term 
signifcam digits. Suppose we write 

1. Tre value = dydp_ . . dd,.,-- - d,, and 

2. Approximate value = jd... . de,-1---¢ 
where d, = 0 and the first difference of the digits is at the (nm + 1)st digit. Then we say 
that (1) and (2) agree 10 m significam digus if |d,., — ¢,.,| < 5. Otherwise, we say 
they agree to n — 1 significant digits. 


Let the tue value = 10/3 and the approximate value = 3.333. 
Then the absolute error = 0.000333. . . = 1/3000; 
the relative error = (1/3000)/(10/3) = 1/10000; 
the number of significant digits =4. «5 


PROGRAMMING FOR NUMERICAL SOLUTIONS 


In this and in all of the following chapters, we will present computer programs that 
implement some of the more important methods of the chapter by a fully written computer 
program. A well-written program can clarify an algorithm as well as provide a ready-to- 
use 1001 for solving problems. We encourage you to improve. modify. or create your own 
version of the program. 

The programs im this book are almost all im the FORTRAN language, even though 
some people prefer to use a language that is more modem im its constructs. The reason 
for choosing FORTRAN is that most practitioners of numerical analysis are familiar with 
1 and_ more important, that there are lots of prewritien subroutines that are of professional 
quality, thoroughly tested, and optimized. In an appendix, you will find more information 
about subroutine libranes. 

When one wriies 2 program. if is not enough just to know some programming 
of the computer and also about efficiency. Consider the followime example: 


fi = Ff = 1424 2) 2 /3!4---- 
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An algorithm for computing this series is 


a 


Set SUM = 1, N = 0, NEXT_TERM = 1. 

Save SUM value in OLDSUM. 

a) Increment N by 1. 

b) Compute NEXT_TERM = NEXT_TERM * X/N. 


c) Compute SUM = SUM + NEXT_TERM. 
d) Compare OLDSUM with SUM. If they are different, go back to 
2, ELSE go to 3. 


Set F = SUM. 


(We know that FORTRAN and most other languages have a built-in exp(x) function 
that can get the result. Just bear with us—the example makes some important points.) 
The following is a correct FORTRAN subprogram that implements the algorithm. 


10 


REAL FUNCTION F(X) 
REAL X 


REAL SUM, OLDSUM, NEXTRM 
INTEGER N 


SUM = 1,0 
NEXTRM = 1.0 


N= 6 
OLDSUM = SUM 
N = N+1 
NEXTRM = NEXTRM+X/N 
SUM = SUM + NEXTRM 
IF (SUM .NE. OLDSUM) GO to 10 


F = SUM 
END 


—| 


The program is easy to understand, but there are problems when it is run with values 
of x that are of large magnitude. Suppose we use it to compute (30.4) or, worse yet, 
f(—30.4). Computing these values of the function requires an inordinate number of 
repetitions of the loop. With x = —30.4, the answer is also entirely incorrect! While the 
correct value for f(—30.4), to five significant digits, is 0.00000, the printout of F was 
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SECOND VERSION OF THE CODE FOR 
COMPUTING F(X) = EXP(X). 


REAL FUNCTION F(X) 

REAL X 

REAL SUM, OLDSUM, NEXTRM, FRACT, EXP1, N 
INTEGER M 

PARAMETER (EXP1 = 2.718281828459) 

SAVE EXP1 


HERE WE EXTRACT THE FRACTIONAL PART OF X 


ABS (X-AINT(X) ) 
0; 


WE COMPUTE THE SERIES AS FAR OUT AS WE NEED TO; THAT IS, 
WE SEE NO CHANGE IN THE SERIES VALUE WHEN 
ADDING AN ADDITIONAL TERM. 


OLDSUM = SUM 

N = N+1.0 

NEXTRM = NEXTRM*FRACT/N 

SUM = SUM + NEXTRM 

IF (SUM .NE. OLDSUM) GO TO 10 


WE STORE THE INTEGER PART OF X IN M 


M = INT( ABS(X) ) 
IF (X .GE. 0.0) THEN 
F = SUM * (EXP1**M) 
ELSE 
F = 1.0/(SUM * (EXP1**M) ) 
END IF 
RETURN 
END 


Figure |.13 Improved program for e* in FORTRAN. 


2.34872! The result varies depending on the computer that is used. This was from an 
HP-150 computer and 104 iterations were required before the program terminated. 
(Another version, with exactly the same program steps but written in BASIC, gave 
91887.2 after 85 iterations!!!) 


1.14 
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Perhaps you have already recognized why this happened. The successive terms of 
the series alternate in sign and are large. As we have previously described, subtraction 
of large numbers can generate very large errors. That is precisely what happens here. 

We can avoid the alternating-sign problem, if we compute f(—30.4) as 1/f(+30.4). 
Doing so, we get 1,594,234,671 * 104 for f(+ 30.4) (exact to 10 digits), and its reciprocal 
is the correct value of f(—30.4). However, 77 terms of the Taylor series were needed. 
We should consider modification to improve the efficiency. One way to do this is the 
following. 

Since f(x) = e*, it is true that (30.4) = (30) * f(0.4), With x = 0.4, the series 
converges rapidly. When x is an integer, we can just raise e to that power and not use a 
series at all. This is the strategy that we will use. Figure 1.13 shows code that does it. 
There is one other improvement that you should note. We declare N as real rather than 
integer to avoid the (hidden) conversion from integer to real that occurred during each 
execution of the loop in the previous version. * 

When /(30.4) and /(—30,4) were computed with the improved version, only || terms 
of the series were needed to converge to the result, and it was accurate to about 10 digits. 

While FORTRAN remains the most widely used language for numerical analysis 
(due largely to the huge libraries of useful subroutines in that language), one may code 
the algorithms in almost any computer language. As an illustration of this, a Pascal 
implementation is given in Figure 1.14. 

The Pascal version is written in Borland’s Turbo Pascal—if you have that compiler 
on your personal computer, you can try it out. (Incidentally, there are Pascal versions of 
all the programs in this book available on a disk, If you are interested in these, see your 
instructor.) Since there is no exponentiation operator in Pascal, we have computed the 
integer power of e by repeated multiplications. A little trick is used to cut down the 
number of multiplies by about half. 


CHAPTER SUMMARY 


If you have understood this chapter on solving nonlinear equations, you are now able to 


1. Explain the four steps in problem solving with reference to a typical problem 
situation. 


2. Use these methods to find solutions to f(x) = 0: 


halving the interval 

linear interpolation 
modified linear interpolation 
Newton's method 

Muller's method 


*Some FORTRAN compilers are smart enough to do this for us, 
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CON: 


VAR 


Pascal program for computing 
f(x) = exp(x). This is based 
on the improved algorithm for 
computing the series. 


ST 
expl = 2.718281828459; 


x, f, 
fractional_part_of_Xx, 
save : REAL; 


number_of_iterations, 
integer_part_of_x INTEGER ; 


This functions computes x**n where n is an integer 


FUNCTION x_to_nth_power(x : REAL; n : INTEGER) : REAL; 
VAR 

i : INTEGER; 

product , square_x REAL; 
BEGIN 

product := 1.0; i := n; square_x : 


WHILE i > O DO 
IF ODD(i) THEN BEGIN 
product := product*x; 
L s6 i-1 END 
ELSE BEGIN 
product := product*square*_x; 
L v= i-2 END; 


x_to_nth_power := product 
END; (* function *) 


HARARE EERE EEE EEEEEEE EERE REESE EESE REESE SEES SSESS SERS E ESS) 


(FERRER RRR REET ERE SEER EEE EERE EEE EEE EEE EE ESSE ESOS EES 


This function computes the series solution for e* 
0 <= x <= 1. 


* 
. 


FOREN EEE E EERE SERRE EEE EES EEE ESEES EES EEE ESSERE EERE ESSE) 


Figure 1.14 


FUNCTION series_value( x : REAL) : REAL; 
VAR 

sum, oldsum, 

nexterm, s : real; 

n : integer; 


Pascal version of improved program. 
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Figure |.14 (continued) 


BEGIN 

n := 0; sum := 1.0; nexterm := 1.0; 
REPEAT 

n := n+l; oldsum := sum; 

nexterm := nexterm*x/n; 

sum := sum + nexterm 

UNTIL sum = oldsum; 

number_of_iterations := n; 
series_value {= sum 


END; 
BEGIN (* MAIN *) 


WRITELN; WRITELN; 


WRITE(' ENTER A GIVEN VALUE: '); READ(X); 
WRITELN; WRITELN; WRITELN; 


(EEO OOS EIEIO EIEIO ISIE III ERIE HEHE 


> Break x into its integer and fractional parts, . 
? then invoke functions for computing powers of e* * 
* for each part. bai 


FOES EEE SORE OUEO CURIE NRE HEI HEH ) 


fractional_part—of_X ;= ABS(x — TRUNC(x) ); 
integer_part_of_X := ABS( TRUNC(x) ); 


IF x <> 0.0 THEN 
save := x_to_nth_power(expl, integer_part_of_X) * 
series_value(fractional_part_of_x) 
ELSE 
f := 1.0; 


IF x > 0.0 THEN f := save; 
IF x < 0,0 THEN f := 1,0/save; 


WRITELN('X':3,'SERIES VALUE':24,'EXACT VALUE':20, 
‘ITERATIONS':16); WRITELN; 


WRITELN( x2632," "\.fi0975," ", 
EXP(x):19:5,number_of_iterations: 13) 


END. 


3, Compare the methods of (2) for efficiency, rate of convergence, certainty of finding 
a root, and reliability of estimates of accuracy of the solution. 

4. Rearrange a function into ax = g(x) form, test to show that the rearrangement will 
converge, and use it to find a root of f(x) = 0. 
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1.15 


5. Explain how the Aitken acceleration can be used to minimize the number of iterations 
required to attain a desired degree of accuracy. 


6. Outline the demonstration that shows when Newton’s method is usually quadratically 
convergent and when it is only linearly convergent; tell what this means in terms 
of increase of precision per step. 

7. Use Newton's and Bairstow’s methods to find all the roots of a polynomial. 


Show that nested multiplication is equivalent to synthetic division and tell why this 
method of evaluating a polynomial is advantageous. 


9. Describe in general terms the QD method and Graeffe’s method. 

10. Distinguish between five types of errors and explain how each can be minimized. 

11. Explain how floating-point numbers are stored in computers and tell what factors 
affect their accuracy and range. 

12. Discuss the relative advantages of using absolute error versus relative error as 
measures of accuracy. 


13. Use the programs and subroutines of Section 1.15 to solve the chapter exercises, 
writing driver programs as required. 

14. Demonstrate that you can write a successful computer program that implements one 
or more of the algorithms of this chapter without referring to any of the example 
programs. 

15. Critique the programs of Section 1.15 and point out how they might be improved. 
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COMPUTER PROGRAMS 


Numerical procedures for obtaining the roots of an equation are often extremely slow 
and tedious when done by hand, although calculators ease the task. In some cases, both 
of these techniques are so tedious as to make the task impossible. In contrast, computers 
are a useful and powerful tool used by mathematicians, scientists, and engineers and are 
ideally adapted to implementing the algorithms presented. The following programs will 
implement the methods presented in Chapter 1. As an alternative, you may wish to use 
canned programs such as those in IMSL (Appendix C); there is a large variety to choose 
from. Among them are ZBRENT, which uses linear interpolation and bisection; ZRPOLY, 
which uses Muller’s method to find the roots of a polynomial with real coefficients; and 
ZSYSTM, which uses Newton’s method. 

We illustrate the program-development process with two examples, the first one 
applied to the ladder-in-the-mine problem that served to introduce this chapter. Recall 
that that problem required us to solve a transcendental equation for C: 
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wcos(7 — A — C) _ wycos C = 


sin’(7 — A — C) me 


In this equation w,, w, and A are supplied as constants. After finding the angle C, we 
get the maximum length from 


Wo Wy 


sm -A—O)* sinc’ 


Cmax ac 


For our first example, we will find a root, 0° < C < (180° — A), using the modified 
linear interpolation algorithm, Our problem has w,; = 7, w2 = 9, and A = 123°, The 
algorithm was given in Section 1.3; we repeat it here but add a second termination 
criterion. 


Modified Linear Interpolation Method 


To determine a root of f(x) = 0, given values of x, and x such that f(x) and 
(x2) are of opposite sign, 


Set SAVE = f(x,); set Fl = f(x); set F2 = f(x). 
REPEAT 
Set x3 =x, — i — on 
IF F(x) of opposite sign to Fl: 
Set x. = x3. 
Set F2 = f(x). 
IF (x3) of same sign as SAVE: 
Set Fl = F1/2. 
ENDIF, 
ELSE Set x, = x3. 
Set Fl = f(x3). 
IF f(x) of same sign as SAVE: 
Set F2 = F2/2. 
ENDIF. 
ENDIF. 
Set SAVE = (x3). 
UNTIL |x, — x2| < XTOL 
OR | f(x3)| < FTOL. 
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Before we can write our program, we must establish the design of our program 
modules. We must also refine the algorithm so that decisions are unambiguous. 


is 


A main program will read in the starting values of x, and x, and call the subroutine 
MDLNIN to execute the algorithm. Reflection reveals that the function is con- 
tinuous in C and has only one zero. Suitable values for x; and x» are 5° and 
118°; these certainly bracket the value that makes the function zero. 

A function subprogram will compute values of f(x). It then is easy to find roots 
of a different function by replacing the function subprogram. To provide flexi- 
bility, we will pass the name of the function subprogram as a parameter to the 
MDLNIN subroutine. This requires us to declare the name of the function 
subprogram as EXTERNAL. We pass the constants of the problem, w;, w>, and 
A, through COMMON. The values of x, and x, are passed as parameters; the 
value of the root is returned in XR. 


In the subroutine, it will be appropriate to test initially whether f(x,) and f(x) 
are indeed of opposite sign, printing a message if they are not. 


The termination on |x, — x>| must be quantified. We will compare this value to 
an input parameter, XTOL, to provide for terminating the calculation. In some 
situations, it may be preferable to terminate when the relative error |x, — x3|/:; 
is less than some predefined tolerance. 


A second termination criterion is whether the function value is extremely small; 
it is possible that this indicates an x-value very near a root. The subroutine 
compares values of f(x3) to the parameter FTOL, to stop when the function is 
“sufficiently small.” 

It is always good practice to limit the maximum number of iterations. The 
parameter NLIM does this. 

The utility of the subroutine is increased if the calling program can easily deter- 
mine on which condition the subroutine terminated. The parameter I signals 
which of the various conditions was satisfied on termination. It is also desirable 
to sometimes print out the successive iterates, but at other times this is not 
needed. The parameter I is used (at the time the subroutine is called) to indicate 
whether or not to print out each iterate. 


In our programs, we try to provide adequate documentation by comments. Some 
documentation systems also require a flowchart;* we provide one in Fig. 1.15. 
Program 1 (see Fig. 1.16) was run with input data as follows: 


X1= 5.0 

X2 = 118.0 
XTOL= 0.01 
FTOL = 0.0001 
NLIM = 20 


*The relative advantages of flowcharts as compared to the statement of the algorithm in a structured form are 
well illustrated by this example. The flowchart is much more specific as regards details of the program but is 
also much more dependent on the implementation language. The structured statement is much more general 
and presents the essentials of the procedure more clearly. 


Figure 1.15 


Main program 


START 


Set values for 
WI, W2,A 


Read values: 
XI, X2, XTOL 
FTOL, NLIM 


Calculate max 
ladder length 


Print LMAX, 
ladder length 


Function FCN 


Compute F(x) = 
required function 
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Subroutine MDLNIN 


ENTER 


Fliandib2 YES eee 
same sign? a 
NO 
Print 
message 
SAVE — FI 


Perform 
iteration, 
XR — new value 


Write last 
values 


56 CHAPTER |: SOLVING NONLINEAR EQUATIONS 


Figure 1.15 


(continued) 


anana 


a 
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PROGRAM LADDER (INPUT, OUTPUT) 


REAL LMAX,W1,W2,A,PI,X1,X2,XR, XTOL, FTOL, FCN 
INTEGER I,NLIM 

EXTERNAL FCN 

COMMON W1,W2,A,PI 


DATA 1/0/ 
Wl = 7.0 
wWw2 = 9.0 
A = 123.0 


PI = 4,0 * ATAN(1.0) 


READ IN INITIAL GUESSES AND TOLERANCES 


READ*, X1,X2,XTOL, FTOL, NLIM 
CALL MDLNIN (FCN, X1,X2,XR, XTOL, FTOL, NLIM, I) 


IF (I .GT. 0 ) THEN 


THE ROOT IS RETURNED IN XR AND USED TO CALCULATE LMAX 


LMAX = W2 / SIN((180.0-A-XR)/180.0*PI) + W1 / SIN(XR/180.0*PI) 


PRINT MAXIMUM LENGTH OF THE LADDER 


PRINT 200, LMAX 
END IF 


200 FORMAT(/’ MAXIMUM LENGTH OF THE LADDER IN FEET IS ',F6.2) 


aaa 


aaaaaa 


STOP 
END 


REAL FUNCTION FCN(X) 
REAL X,W1,W2,A,PI 


REAL ANG1,ANG2 
COMMON W1-,W2,A,PI 


COMPUTE FUNCTION VALUE 


ANG1 = (180.0-A-X) /180.0*PI 

ANG2 X/180.0*PI 

FCN = W2 * COS(ANG1)/SIN(ANG1)**2 - Wl * COS(ANG2) /SIN(ANG2) **2 
RETURN 

END 


SUBROUTINE MDLNIN : 
THIS SUBROUTINE FINDS A ROOT BY MODIFIED 
LINEAR INTERPOLATION, 


Figure 1.16 Program 1. 


HOGBOONHOHONHDOH990020000 


oO 


oo0000 


990000 


WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER 
PRINT ZACH VALUE OR NOT. I=0 MEANS PRINT THEM, .NE. 0 MEAKS 


FR = FON(XR) 
XERR = ABS (X2-X2)/2.0 
iF (OUTPUT) THEW 


RETURN 


TE ( ABS(ER) .LE. FOL ) THES 


2 
INT 202, 5,XR,FR 


—FUNCTION THAT COMPUTES VALUES FOR F(X). MUST BE DECLARED 
EXTERNAL IN CALLING PROGRAM. IT HES ONE ARCUMENT, X. 
-INITIAL VALUE OF X. F(X) MUST CEANGE SIGNS AT THESE PTS. 
RETURNS THE ROOT TO THE MAIN PROGRAM. 

-TOLERANCE VALUES FOR X, F(X) TO TERMINATES ITERATIONS. 
-LIMIT TO NUMBER OF ITERATIONS. 

-A SIGHAL FOR SOW ROUTINE TERMINATED. 

MEETS TOLERANCE FOR X VALUES. 

MEETS TOLERANCE FOR F(X). 

NLIM EXCEEDED. 

(Xi) BOT OPPOSITE IN SIGN TO F(X2). 


2* (N2-Xi) / (F2-Fi) 
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Figure 1.16 (continued) 
I=2 
PRINT 203, J,XR,FR 
RETURN 
END IF 
c 
G ) Se seme ttle ie tab awcanneeneeceeennese Stet aaa arene 
c 
C FIND NEW POINT 
Cc 
IF ( FR*F1 .GE. 0.0 ) THEN 
X1 = XR 
Fl = FR 
IF ( FR*FSAVE .GT. 0.0 ) F2 = F2/2.0 
FSAVE = FR 
ELSE 
X2 = XR 
F2 = FR 
IF ( FR*FSAVE .GT. 0.0 ) Fl = F1/2.0 
FSAVE = FR 
END IF 
20 CONTINUE 
c 
> cusessntei eae neon na ee unee nee ECaeRn eet aaeenieanes 
Cc 
C WHEN LOOP IS NORMALLY COMPLETED, NLIM IS EXCEEDED. 
c 
I= -1 
PRINT 200, NLIM,XR,FR 
RETURN 
c 
A? . Sena ete a ern aeestan ee aoennna ee ee sae aoe eal antnnceeeeeeecause 
Cc 
199 FORMAT(’ AT ITERATION’,13,3X,’ X = ',B10.5,3X,’ F(X) 
200 FORMAT(/’ TOLERANCE NOT MET AFTER ',14,’ ITERATIONS. 
+ ’ AND F(X) = ',E12.5) 
201 FORMAT(/‘ FUNCTION HAS SAME SIGN AT INITIAL X1 & X2') 
202 FORMAT(/’ X TOLERANCE MET IN ',14,’ ITERATIONS. X = ',£12.5, 
+ ‘ AND F(X) = ',E12.5) 
203 FORMAT(/’ F TOLERANCE MET IN ‘,14,’ ITERATIONS. X = ',£12.5, 
+ ’ AND F(X) = ',E12.5) 
END 
OUTPUT FOR PROGRAM 1 
AT ITERATION 1 X = .11678E+03 F(X) = .10024E+02 
AT ITERATION 2 X = .11556E+03 F(X) =  ,10160E+02 
AT ITERATION 3 X = .11314E+03 F(X) =  .10524E+02 
AT ITERATION 4 X = .10836E+03 F(X) = .11660E+02 
AT ITERATION 5 X = .98739E+02 F(X) =  .16241E+02 
AT ITERATION 6 x .77902E+02 F(X) = -64523E+02 
AT ITERATION 7 X = .27286E+02 F(X) = + 22130E+01 
AT ITERATION 8 X = .24282E+02 F(X) = -.11813E+02 
AT ITERATION 9 x «26812E+02 F(X) = -61146E-01 
AT ITERATION 10 X = .26799E+02 F(X) =  ,18591E-02 
AT ITERATION 11 X = .26798E+02 F(X) = ~-.17452E-02 
AT ITERATION 12 X = .26799E+02 BUX) = +42591E-08 
F TOLERANCE MET IN 12 ITERATIONS. X = .26799E+02 AND F(X) = 42591E-08 


MAXIMUM LENGTH OF THE LADDER IN FEET IS 33.42 
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The output is shown with the program. We conclude that quite long ladders can negotiate 
this turn in the mine shaft; the calculated length is over 33 ft. 

For our second example of program development, we exhibit a program to solve for 
the roots of a polynomial up to the 10th degree, using Bairstow’s method to get the 
quadratic factors. The algorithm, which is an adaptation of the equations of Section 1.9, 
can be written as shown below. Subscripts are advanced by two, so the equations for all 
the b’s and c’s become identical when b, = by = c; = cp = O. (See Program 2, Fig. 
alga) 


Bairstow’s Method 


To determine a quadratic factor, x — rx — s, of the nth-degree polynomial 
3x" + agx"! + + ©» + nyox + Gy43, Choose initial coefficients R and S 
of the quadratic factor, then 


Set B(1) = 0, B(2) = 0, C(1) = 0, C(2) = 0, DELR = 1000, DELS = 1000. 
DO WHILE DELR = tolerance value, or 
DELS = tolerance value, 
DO FOR J = 3 ton + 3 step 1, 
Set BU) = AU) + R* BU — 1) + S* BU — 2). 
Set CJ) = BJ) + R* CJ — 1) + S* CU — 2). 
Set DENOM = C(N + 1) * C(N + 1) — C(N + 2) * C(N). 
IF DENOM = 0: 
SetR=R+1. 
StS=S+1. 
Repeat from beginning. 
ENDIF. 
Set DELR = [—B(N + 2) * C(N + 1) + B(N + 3) 
* C(N)]/DENOM. 
Set DELS = [—C(N + 1) * B(N + 3) + C(N + 2) 
* B(N + 2)]/DENOM. 
Set R = R + DELR. 
Set S = S + DELS. 
ENDDO. 
ENDWHILE 


Because the program handles only one specific type of equation, there is no advantage 
to subdividing it into subroutines. The number of iterations needed to determine the 
quadratic factor within a given tolerance is arbitrarily limited to 20. In case the denominator 
during the step where we solve for new R and S values is zero, the trial values of R and 
S are arbitrarily increased by unity and the search for a quadratic factor is begun again— 
this will ordinarily avoid the attempted division by zero. When a quadratic factor is found 
successfully, the deflated polynomial (the b’s are its coefficients) is used to find additional 
factors. (The algorithm applies only to obtaining each quadratic factor.) The program 
could be used for polynomials of degree higher than 10 by changing the DIMENSION 
statement, but this is not advisable because accumulated errors will often be too large. 
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PROGRAM BRSTOW (INPUT, OUTPUT) 


c 
C THIS PROGRAM USES BAIRSTOW’S METHOD OF EXTRACTING THE 
C QUADRATIC FACTORS OF A POLYNOMIAL. 
c 
ie 
ce 
C READ IN THE DEGREE OF THE ORIGINAL POLYNOMIAL AND THE 
C ESTIMATES R & S FOR THE QUADRATIC FACTOR. (IF ESTIMATES NOT AVAILABLE, 
C R & S ARE TAKEN AS ZERO.) ALSO READ IN ERROR TOLERANCE. 
ie 
REAL A(13),B(13),C(13),R,S,DELR, DELS, TOL, DENOM 
INTEGER I,J,NP3,K 
¢ 
READ *, N,R,S,TOL 
c 
eh “tedacteem a neser este eae akan a era eaten! 
G 
C READ IN COEFFICIENTS OF THE POLYNOMIAL 
6 
NP3 = N+ 3 
READ *, ( A(I), I=3,NP3 ) 
c 
FE ee Se gt eh cai ca ante 
c 
C PRINT HEADING AND ECHO PRINT THE COEFFICIENTS 
¢ 
PRINT 102 
DO 5 I=3,NP3 
J = NP3 - I 
PRINT 103, J,A(I) 
5 CONTINUE 
PRINT 104, TOL 
c 
sy nn an ce hee ar icecream tera ee erie 
c 
C COMPUTE B AND C ARRAYS. LIMIT NUMBER OF ITERATIONS TO 20. 
(si 
DATA B,C,K / 26*0.0,1 / 
8 IF ( K .LE. 20 ) THEN 
DO 10 J=3,NP3 
B(J) = A(J) + R*B(J-1) + S*B(J-2) 
C(J) = B(J) + R*C(J-1) + S*C(J-2) 
10 CONTINUE 
c 
G = 
& 
C COMPUTE DENOMINATOR AND CHECK IF ZERO. IF NOT ZERO, COMPUTE NEW R 
C AND S VALUES. CHECK IF ACCURACY IS OK. 
c 
DENOM = C(N+1) * C(N+1) - C(N+2) * C(N) 
IF ( DENOM .NE. 0.0 ) THEN 
DELR = ( -B(N+2) * C(N+1) + B(N+3) * C(N) ) / DENOM 
DELS = ( -C(N+1) * B(N+3) + C(N+2) * B(N+2) ) / DENOM 
R = R + DELR 
S = S + DELS 


IF ( ABS(DELR) + ABS(DELS) .LE. TOL ) THEN 
PRINT 106, R,S 
N=N-2 
IF ( N- 2) 22,23,24 


Figure 1.17 Program 2. 


c 
Cc 
c 


000 


anaaana 


90000 


anno 


aa0 


REDUCED EQUATION IS OF DEGREE ONE - PRINT iT 


22 PRINT 107, B(N+2),5(N+3) 
sToP 


REDUCED EQUATION IS OF DEGREE TWO - PRINT iT 


23 PRINT 108, B(N+1),5(N+2),5(N+3) 
STOP 


GREE OF REDUCED EQUATION MORE THAN TWO. SET COEFFICIENTS INTO AN 
ARRAY AND GET NEXT FACTOR. 


24 NP3 = N+3 
DO 30 I=3,NP3 
a(t) = 8(2) 
30 CONTINUE 


START OVER. 


RANCE NOT MET. PRINT FACTOR AND QUIT. 


102 FORMAT(//,’ QUADRATIC FACTORS BY BATRSTOW METHOD.’ / 
+ * ORIGINAL POLYNOMIAL IS - °,//* POWER OF X °, 
+ 5X,° COEFFICIENT’ /) 


FORMAT (16, 10X,F10.3) 

FORMAT (/’ FACTORS ARE, WITH TOLERANCE OF °,E14.6) 

RMAT(’ TOLERANCE NOT MET IN 20 ITERATIONS. LAST FACTOR FOUND ’, 
+ “WAS ‘,/* X**2 — °,F10.5," X — *,¥10.5) 

FORMAT(’ X**2 — ’,F8.5,° X -— ",FB.5) 

FORMAT (1X,F10.5," X + °,F10.5) 

FORMAT (1X,F10.5," X**2 + *%,F10.5," X + °,F10.5) 
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Figure |.17 (continued) 


OUTPUT FOR PROGRAM 2 


QUADRATIC FACTORS BY BAIRSTOW METHOD, 
ORIGINAL POLYNOMIAL IS - 


POWER OF X COEFFICIENT 
5 1.000 
4 -17.800 
3 99,410 
r) -261.218 
1 352.611 
0 -134.106 
FACTORS ARE, WITH TOLERANCE OF .100000E-03 


X**2 - 4.20000 X - -2.10000 
X**2 - 3,30002 X - -6.20007 
1.00000 xX + -10.30000 


When the program was tested with the polynomial x5 — 17.8x4 + 99.41x3 — 
261.218x? + 352.611x — 134,106, the results were as shown after the program. The 
exact factors are 


x? — 4.2x + 2.1, x2 — 3.3x + 6.2, and x — 10.3, 


and the program finds them with excellent precision. 
The input data were 


N = 5 (degree) 
R= 0.0 
S = 0.0 
TOL = 0.0001 
Coefficients = 1.0, —17.8, 99.41, —261.218, 


352.611, —134.106 


Programs 3 through 7 (Figs. 1.18 through 1.22) are FORTRAN implementations of 
other algorithms of this chapter: 


Subroutine 
Program Method Name 
3 Interval halving (bisection) INTHV 
4 Linear interpolation (regula falsi) LNINTP 
5 Newton’s method NEWTN 
6 Iteration with x = g(x) XGXIT 
7 Muller’s method MULLR 


Each of the subroutines is tested with the function f(x) = 3x + sin(x) — e* = 0. 
These subroutines might appropriately be placed in the FORTRAN library at your 
installation so that they may be called automatically by your programs. 
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PROGRAM PINTHV (INPUT, OUTPUT) 


DRIVER FOR INTERVAL HALVING SUBROUTINE, INTHV 


REAL FCN,X1,X2,XTOL, TOL 
INTEGER I,NLIM 


INITIALIZE VARIABLES FOR SUBROUTINE INTHV 


DATA X1,X2,XTOL,FTOL, I,NLIM/0.0,1.0,0.0001,0.00001,0,50/ 


CALL INTHV(FCN,X1,X2,XR,FTOL,NLIM, I) 


THE ROOT HAS BEEN FOUND AT XR IF I EQUALS 1 OR 2 


c 
c 
c 
c 
c 
Cc 
c 
¢ 
€ 
c 
c 


STOP 
END 


REAL FUNCTION FCN(X) 
REAL X 
FCN = 3.0*X + SIN(X) - EXP (X) 


c 

C --------------------------------------------------------------------- 
c 

¢ SUBROUTINE INTHV : 

c THIS SUBROUTINE FINDS THE ROOT OF A FUNCTION, 

c F(X) = 0, BY INTERVAL HALVING. 

c 

c 

c 

c PARAMETERS ARE : 

Cc 

c FCN -FUNCTION THAT COMPUTES VALUES FOR F(X). MUST BE DECLARED 
c EXTERNAL IN CALLING PROGRAM. IT HAS ONE ARGUMENT, X. 

c X1,X2 -INITIAL VALUES OF X. F(X) MUST CHANGE SIGN AT THESE PTS. 
Cc XR -RETURNS THE ROOT TO THE MAIN PROGRAM. 

c XTOL,FTOL -TOLERANCE VALUES FOR X, F(X) TO TERMINATE ITERATIONS. 

c NLIM -LIMIT TO NUMBER OF ITERATIONS. 

c Tt -A SIGNAL FOR HOW ROUTINE TERMINATED ON OUTPUT. 

Cc I 1 MEETS TOLERANCE FOR F(X) VALUES. 

c I =e NLIM EXCEEDED. 

c I a2 F(X1) NOT OPPOSITE IN SIGN TO F(X2) 

c -PRINT CONTROL ON INPUT 

c I=0 PRINT RESULTS AT EACH ITERATION 

c 

c WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER TO 
c PRINT EACH VALUE OR NOT. I=0 MEANS PRINT THEM, I.NE.0 MEANS DON’T. 


Figure 1.18 Program 3. 


Figure |. 
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(continued) 
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REAL FCN,X1,X2,XR, FTOL 
INTEGER NLIM, I,J 
REAL F1,F2,FR 


CHECK THAT F(X1) & F(X2) DIFFER IN SIGN 


Fl = FCN(X1) 

F2 = FCN(X2) 

IF ( F1*F2 .GT. 0.0) THEN 
I = -2 
PRINT 201 
RETURN 

END IF 


COMPUTE SEQUENCE OF POINTS CONVERGING TO THE ROOT 


DO 20 J=1,NLIM 


XR = (X1 + X2) / 2.0 

FR = FCN (XR) 

IF ( I .EQ. 0 ) THEN 
PRINT 199, J,XR,FR 

END IF 


CHECK ON STOPPING CRITERIA 


IF ( ABS(FR) .LE. FTOL ) THEN 
es A 

PRINT 203, 

RETURN 


END IF 


J, XR, ER 


COMPUTE 


iF. 
x1 
EL 
ELSE 
X2 = 
F2 = 
END IF 
CONTINUE 


.GT. 0.0 ) THEN 


20 


WHEN LOOP IS NORMALLY COMPLETED, 


I=-l 
PRINT 200, NLIM,XR,FR 
RETURN 


NLIM IS EXCEEDED 
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Figure 1.18 (continued) 


199 FORMAT(’ AT ITERATION’,13,3X,’ X =',E12.5,4X,’ F(X) =’,E12.5) 


200 FORMAT (/’ 


+ E12.5,’ F(X) = ',E12.5) 


TOLERANCE NOT MET AFTER ‘',14,’ 


201 FORMAT(/’ FUNCTION HAS SAME SIGN AT Xl & X2 ’) 
202 FORMAT(/’ TOLERANCE MET IN ’,14,’ ITERATIONS x 


+ * P(X) = ',B12.5) 


203 FORMAT(/’ F TOLERANCE MET IN ’ 


AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 
AT 


+ * F(X) = ‘,E12.5) 
END 
OUTPUT FOR PROGRAM 


ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 
ITERATION 


+ 50000E+00 
+25000E+00 
+ 37500E+00 
«31250E+00 
+34375E+00 
-35938E+00 
-36719E+00 
+ 36328E+00 
+ 36133E+00 
+ 36035E+00 
-36084E+00 
+ 36060E+00 
+36047E+00 
+ 36041E+00 


WOIDUOSLWNH 
oe een eee 


xx MMM MK KKK 


TOLERANCE MET IN 14 ITERATIONS x 


PROGRAM PLNINT (INPUT, OUTPUT) 


DRIVER FOR LINEAR INTERPOLATION 


REAL FCN, X1,X2,XTOL, FTOL 
INTEGER I,NLIM 
EXTERNAL FCN 


,14,’ ITERATIONS 


3 


F(X) 
F(X) 
F(X) 
F(X) 
F(X) 
F(X) 
F(X) 
F(X) 
F(X) 
F(X) 


ITERATIONS X = 


= ',E12.5, 


X= %,B12.5, 


«33070E+00 
+ 28662E+00 
-36281E-01 
+12190E+00 
-41956E-01 
+ 26196E-02 
+16886E-01 
+ 71467E-02 
+22670E-02 
+17548E-03 


F(X) -10460E-02 


F(X) 
F(X) 
F(X) 


+ 43529E-03 
-12992E-03 
-.22780E-04 


= -36041E+00 F(X) = -.22780E-04 


(I.E. REGULA FALSI) 


INITIALIZE VARIABLES FOR SUBROUTINE LNINTP 


SUBROUTINE, 


DATA X1,X2,XTOL,FTOL,I,NLIM/0.0,1.0,0.0001,0.00001,0,50/ 


CALL LNINTP (FCN,X1,X2,XR,XTOL, FTOL, NLIM, 1) 


THE ROOT HAS BEEN FOUND AT XR IF I EQUALS 1 OR 2 


STOP 


Figure 1.19 Program 4. 


LNINTP 


Figure 1.19 
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(continued) 


REAL FUNCTION FCN(X) 

REAL X 

FCN = 3.0*X + SIN(X) - EXP(X) 
RETURN 

END 


SUBROUTINE LNINTP 
THIS SUBROUTINE FINDS THE ROOT OF A FUNCTION, 
F(X) = 0, BY LINEAR INTERPOLATION (I.E. REGULA FALSI). 


PARAMETERS ARE 


FCN -FUNCTION THAT COMPUTES VALUES FOR F(X). MUST BE DECLARED 
EXTERNAL IN CALLING PROGRAM. IT HAS ONE ARGUMENT, X. 

X1,X2 -INITIAL VALUES OF X. F(X) MUST CHANGE SIGN AT THESE PTS. 

XR -RETURNS THE ROOT TO THE MAIN PROGRAM. 

XTOL,FTOL -TOLERANCE VALUES FOR X, F(X) TO TERMINATE ITERATIONS. 

NLIM -LIMIT TO NUMBER OF ITERATIONS. 

I -A SIGNAL FOR HOW ROUTINE TERMINATED. 

io= 2 MEETS TOLERANCE FOR X VALUES. 

Ir=2 MEETS TOLERANCE FOR F(X) 

ce NLIM EXCEEDED. 

I= -2 F(X1) NOT OPPOSITE IN SIGN TO F(X2) 


WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER TO 
PRINT EACH VALUE OR NOT. I=0 MEANS PRINT THEM, .NE. 0 MEANS DON’T. 


REAL FCN, X1,X2,XR,XTOL, FTOL 
INTEGER NLIM, I,J 
REAL F1,F2,FR,XERR 


CHECK THAT F(X1) & F(X2) DIFFER IN SIGN 


F1 FCN (X1) 

2 FCN (X2) 

IF ( F1*F2 .GT. 0.0) THEN 
I = -2 
PRINT 201 
RETURN 

END IF 


COMPUTE SEQUENCE OF POINTS CONVERGING TO THE ROOT 


DO 20 J=1,NLIM 
XR = X2 - F2*(X2-X1) /(F2-F1) 
FR = FCN(XR) 
XERR = ABS(X1 - X2) / 2.0 
IF (I .EQ. 0 ) THEN 

PRINT 199, J,XR,FR 
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PROGRAM PNEWTN (INPUT, OUTPUT) 


DRIVER FOR NEWTON’S METHOD SUBROUTINE, NEWTN 


REAL FCN, FDER,X,XTOL,FTOL 
INTEGER I,NLIM 
EXTERNAL FCN, FDER 


c 
G | ctiacomenisinente avon eneencnaneaeeenn eee eeee een gear saecaneenwels 
c 
C INITIALIZE VARIABLES FOR SUBROUTINE NEWTN 
c 
DATA X,XTOL,FTOL,I,NLIM/0.5,0.0001,0.00001,0,50/ 
c 
@) echaSeeca cheaper eee ee ea ee eee case 
c 
CALL NEWTN (FCN, FDER, X,XTOL, FTOL, NLIM, I) 
c 
C THE ROOT HAS BEEN FOUND AT X IF I EQUALS 1 OR 2 
c 
STOP 
END 
c 
Ce nner 
c 
REAL FUNCTION FCN(X) 
REAL X, SIN, EXP 
FCN = 3.0*X + SIN(X) - EXP (x) 
RETURN 
END 
c 
c 
c 
REAL FUNCTION FDER(X) 
REAL X, COS, EXP 
FDER = 3.0 + COS(X) - EXP (xX) 
RETURN 
END 
c 
Cn ne nner 
c 
SUBROUTINE NEWTN (FCN, FDER, X,XTOL, FTOL,NLIM, I) 
c 
@  ecsosceecuuanchowiesasncccesenseamaenaidieadnotosuaseseessossiese 
c 
c SUBROUTINE NEWTN : 
c THIS SUBROUTINE FINDS THE ROOT 
c F(X) = 0, BY NEWTON’S METHOD. 
c 
Been eee a Ss cee Sea tere aeons 
c 
€ PARAMETERS ARE 
é 
¢ FCN “FUNCTION THAT COMPUTES VALUES FOR F(X), MUST 
c EXTERNAL IN CALLING PROGRAM. IT HAS ONE ARG 
Cc FDER -FUNCTION THAT COMPUTES THE DERIVATIVE OF F 
c DECLARED EXTERNAL. 
c x “INITIAL VALUE OF X, SHOULD BE NEAR ROOT. ALSO & 
c OF THE ROOT TO THE CALLER. 
Ee XTOL,FTOL -TOLERANCE VALUES FOR X, F(X) TO TERMINATE IT. 


Figure 1.20 Program 5. 
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Figure | 


aaaaanaaaanaad 
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.20 (continued) 


NLIM -LIMIT TO NUMBER OF ITERATIONS. 

i -A SIGNAL FOR HOW ROUTINE TERMINATED. 
MEETS TOLERANCE FOR X VALUES. 

MEETS TOLERANCE FOR F(X). 

NLIM EXCEEDED. 


WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER TO 
PRINT EACH VALUE OR NOT. I=0 MEANS PRINT THEM, .NE. 0 MEANS DON’T. 


REAL FCN, FDER,X,XTOL,FTOL 
INTEGER NLIM,I,J7 
REAL FX,DELX 


COMPUTE SEQUENCE OF POINTS CONVERGING TO THE ROOT | 


FX = FCN(X) i 

DO 20 J=1,NLIM 

DELX = FX / FDER(X) 

X = X - DELX 

FX = FCN(X) j 

IF (I .EQ. 0 ) THEN 
PRINT 199, J,X,FX 

END IF 


CHECK ON STOPPING CRITERIA 


IF ( ABS(DELX) .LE. XTOL ) THEN 
I=1 


PRINT 202, J,X,FX 
RETURN 
END IF 


IF ( ABS(FX) .LE, FTOL ) THEN 
I=2 
PRINT 203, J,X,FX 
RETURN 
END IF 
20 CONTINUE 


WHEN LOOP IS NORMALLY COMPLETED, NLIM IS EXCEEDED 


I=-1 
PRINT 200, NLIM,X,FX 
RETURN 


199 FORMAT(’ AT ITERATION’,13,3X," X ="’,E12.5,3X,"’ F(X) =',E12.5) 


200 FORMAT(/’ TOLERANCE NOT MET AFTER ',14,’ ITERATIONS X = ‘, 
$3 B12.5," F(X) = °,E12.5) 

202 FORMAT(/’ TOLERANCE MET IN’,13,’ ITERATIONS: X =",E12.5, 
+ 3X," F(X) =',E12.5) 

203 FORMAT(/’ F TOLERANCE MET IN’,13,’ ITERATIONS X =',E12.5, 


+ 3X,’ F(X) =",E12.5) 
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Figure 1.20 (continued) 


c END 

AT ITERATION 1 X = .35163E+00 F(X) -,.22073E-01 

AT ITERATION 2 X = .36039E+00 F(X) 68143E-04 

AT ITERATION 3 X = .36042E+00 F(X) + 66267E-09 

TOLERANCE MET IN 3 ITERATIONS; X = ,36042E+00 F(X) = -,66267E-09 


PROGRAM PXGXIT (INPUT, OUTPUT) 


THIS DRIVER CALLS SUBROUTINE XGXIT 


REAL GFCN,X 


INTEGER I 

EXTERNAL GFCN 

DATA X,I / 0.6, 0 / 

CALL XGXIT(GFCN,X,0.0001,50,T1) 


1, THE ROOT IS EQUAL TO xX, 


REAL FUNCTION GFCN (X) 


REAL X 
GFCN = ( EXP(X) - SIN(X) ) / 3.0 
RETURN 


SUBROUTINE XGXIT 


THIS SUBROUTINE FINDS THE ROOT OF F(X) = 0 
BY ITERATION WITH X = G(X), 


PARAMETERS ARE 


amaaaAaaanaAaaaaaaaaaaA aaa 


GFCN -FUNCTION TO COMPUTE G(X), MUST BE DECLARED EXTERNAL IN 
CALLING PROGRAM. IT HAS ONE ARGUMENT, X. 
x -INITIAL VALUE TO BEGIN ITERATIONS. X ALSO RETURNS THE 
VALUE OF THE ROOT TO THE CALLER. 
XTOL -TOLERANCE VALUE FOR CHANGE IN X TO TERMINATE ITERATIONS. 
NLIM -LIMIT TO NUMBER OF ITERATIONS, 
L -A SIGNAL FOR HOW TERMINATED 
T=. MEETS TOLERANCE FOR X VALUES. 
cal NLIM EXCEEDED, 
=-2 


ITERATIONS APPEAR TO DIVERGE IN INITIAL CALCULATIONS. 


Figure 1.21 Program 6, 
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Figure 1.21 (continued) 


— 
WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER TO 
PRINT EACH VALUE OR NOT. I=0 MEANS PRINT, .NE. 0 MEANS DON’T. 


aaaaaa 


REAL GFCN,X,XTOL 
INTEGER NLIM, I,J 
REAL SAVEX, DEL1,DEL2 


CHECK INITAL VALUES 


aanaana 


Fee 
SAVEX = X 
X = GFCN(X) 
DEL1 = ABS(X - SAVEX) , 
IF ( DEL! .LE. XTOL ) THEN 

I=1 

PRINT 202, J,SAVEX,X 

RETURN ' 
END IF ; 
IF (I .EQ. 0 ) THEN 

PRINT 199, J,SAVEX,X 
END IF 


GENERATE SEQUENCE OF POINTS 


aaaaa 


DO 20 J=2,NLIM 

SAVEX = X 

X = GFCN(X) 

DEL2 = ABS(X - SAVEX) 


CHECK FOR XTOL CONDITION 


aaaaa 


IF ( DEL2 .LE. XTOL ) THEN 
Pea” 
PRINT 202, J,SAVEX,X 
RETURN 

END IF 


CHECK FOR DIVERGENCE AT START 


aaqaca 


IF (Jd .EQ. 2) THEN 
IF ( DEL1 .LE. DEL2 ) THEN 
P= ag 
PRINT 201 
RETURN 
END IF 
END IF 
IF ( I .EQ. 0 ) PRINT 199, J,SAVEX,X 
20 CONTINUE 


C WHEN LOOP IS NORMALLY TERMINATED, NLIM IS EXCEEDED. 
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Figure 1.21 (continued) 


Cc 
a. el 
PRINT 200, NLIM,SAVEX,X 
RETURN 
c 
OC Secusesetnae aenewaewaseew aa aa ee etesaa naa ceaaaaeneetedaeenanenees 
c 
199 FORMAT(’ AT ITERATION’,13,4X,’ X =',F9.4,4X,'’G(X) =',F9.4) 
200 FORMAT(/’ METHOD DID NOT CONVERGE IN ‘,14,' ITERATIONS. FINAL ‘, 
Eo ‘VALUES ARE X = ',F12,5,’ G(X) = ',F12,5) 
201 FORMAT(/’ FIRST THREE VALUES INDICATE DIVERGENCE.’ ) 
202 FORMAT(/’ XTOL MET IN ‘,14,’ ITERATIONS. X * ELQ05, 
+ * G(X) = ',F12.5) 
END 
OUTPUT FOR PROGRAM 6 
AT ITERATION 1 x = - 6000 G(X) = +4192 
AT ITERATION 2 a= ~4192 G(X) = or l2 
AT ITERATION 3 = 3712 G(X) = 3623 
AT ITERATION 4 x= - 3623 G(X) = +3607 
‘AT ITERATION 5 X= - 3607 G(X) = +3605 
XTOL MET IN 6 ITERATIONS. X = .36047 G(X) = .36043 


aaaaa 


aaa a 


Figure | 


PROGRAM PMULLR (INPUT, OUTPUT) 


THIS DRIVER CALLS SUBROUTINE MULLR 


REAL FCN, XR,H, XTOL, FTOL 

INTEGER I,NLIM 

EXTERNAL FCN 

DATA XR, H,XTOL,FTOL,NLIM,1/0.5,0.2,0.0001,0.00001,50,0/ 
CALL MULLR (FCN, XR, H, XTOL, FTOL, NLIM, I) 


IF I = 1 OR 2, THE ROOT IS EQUAL TO XR 


REAL FUNCTION FCN(X) 


REAL X 
FCN = 3*X + SIN(X) - EXP (X) 


SUBROUTINE MULLR (FCN, XR, H, XTOL, FTOL, NLIM, I) 


.22 Program 7. 
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Figure 1.22 (continued) 


oa000a 


aaq0aa 


a aaaaa 


SUBROUTINE MULLR : 


THIS SUBROUTINE FINDS THE ROOT OF F(X) = 0 BY 
QUADRATIC INTERPOLATION ON THREE POINTS - MULLER’S METHOD. 


FCN -FUNCTION THAT COMPUTES VALUES FOR F(X). MUST BE DECLARED 
EXTERNAL IN CALLING PROGRAM. IT HAS ONE ARGUMENT, X. 

XR -INITIAL APPROXIMATION TO THE ROOT. USED TO BEGIN 
ITERATIONS. ALSO RETURNS THE VALUE OF THE ROOT. 

5 -DISPLACEMENT FROM X USED TO BEGIN CALULATIONS. THE FIRST 


QUADRATIC IS FITTED AT F(X), F(X+H), F{(X-H). 
XTOL,FTOL -TOLERANCE VALUES FOR X, F(X) TO TERMINATE ITERATIONS. 
z -A SIGNAL FOR HOW ROUTINE TERMINATED. 
I=1 MEETS TOLERANCE FOR X VALUES. 
Tae MEETS TOLERANCE FOR F(X). 
ea & NLIM EXCEEDED. 


WHEN THE SUBROUTINE IS CALLED, THE VALUE OF I INDICATES WHETHER TO 
PRINT EACH VALUE OR NOT. I=0 MEANS PRINT THEM, I.NE.0 MEANS DON’T. 


REAL FCN, XR,H,XTOL,FTOL 
INTEGER NLIM,I,J7 
REAL Y(3),F1,F2,F3,H1,H2,G,A,B,C,DISC, FR, DELX 


=XR-H 
= XR 
= XR+4H 

Fl = FCN( ¥(1) ) 
= FCN( ¥(2) ) 
= FCN( ¥(3) ) 


BEGIN ITERATIONS 


DO 20 J=1,NLIM 

Hl = ¥(2) - ¥(1) 

H2 = ¥(3) - ¥(2) 

= Hi / H2 

= ( F3*G - F2*(1.0 + G) + Fl ) / ( H1*(H1 + B2)) 
=i¢ BS, = E2 = Ate2ee8? }\ (fF He 

= F2 

ISC = SQRT( B*B - 4.0*A*C ) 

F (B .LT. 0.0 ) DISC = -DISC 


my 


IND ROOT OF QUADRATIC : A * V**2 + B*V + C = 0 


DELX = -2.0 * Cc / ( B + DISC ) 


aancaa 


eaaaaaana 


aaaaa 
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1.22 (continued) 


UPDATE XR 
XR = ¥(2) + DELX 
FR = FCN (XR) 


IF (I .EQ. 0 ) PRINT 199, J,XR,FR 


CHECK STOPPING CRITERIA 


IF ( ABS(DELX) .LE. XTOL ) THEN 
r=1 
PRINT 202, J,XR,FR 
RETURN 

END IF 


IF ( ABS(FR) .LE. FTOL ) THEN 
Ir=2 
PRINT 203, J,XR,FR 
RETURN 

END IF 


SELECT THE THREE POINTS FOR THE NEXT ITERATION. WHEN DELX .GT. 0, CHOOSE 
Y(2), Y(3), & XR, BUT WHEN DELX ,LT. 0 CHOOSE Y(1), Y¥(2), & XR. 


ENTER THE PROPER SET INTO Y ARRAY SO THEY ARE IN ASCENDING ORDER. 


IF ( DELX .GE. 0 ) THEN 


y(1) = ¥(2) 

PL. = F2 

IF ( DELX .GT. H2 ) THEN 
¥(2) = ¥(3) 
F2 = F3 
Y¥(3) = XR 
1) = FR 

ELSE 
Y(2) = XR 
F2 = FR 

END IF 

ELSE 

YX (3). =-¥ (2) 

F3 0 = F2 

IF ( ABS(DELX) .GT. Hl ) THEN 
¥(2) = ¥(1) 
EZ = DY, 
Y(1) = XR 
Fi = FR 

ELSE 
Y(2) = XR 
F2 = FR 

END IF 

END IF 


20 CONTINUE 


WHEN LOOP IS NORMALLY TERMINATED, NLIM IS EXCEEDED 
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Figure 1.22 (continued) 


I= -1 - 
PRINT 200, NLIM,XR,FR 
RETURN 


199 FORMAT(’ AT ITERATION’, 13, 3X,’ X =',E12.5,4X,"F(X) = ',E12.5) 

200 FORMAT(/’ TOLERANCE NOT MET AFTER ',14,' ITERATIONS X = ', 
+ E12.5," F(x) = %,E12;5) 

201 FORMAT(/’ FUNCTION HAS SAME SIGN AT X1 & X2 ‘) 

202 FORMAT(/’ TOLERANCE MET IN ',12,’ ITERATIONS X = ',E10.5, 
+ ty Ee (Xi et.) 

203 FORMAT(/’ F TOLERANCE MET IN ’,14,’ ITERATIONS X = ’,E10.5, 

* F(X) = %,B10.5) 


OUTPUT FOR PROGRAM 7 


AT ITERATION 1 x +35995E+00 F(X) = -,11837E-02 


AT ITERATION 2 x + 36042E+00 F(X) +15922E-05 


F TOLERANCE MET IN 2 ITERATIONS X = .36042E+00 F(X) 


EXERCISES 


It is impossible to learn numerical analysis without lots of practice. We provide you with 
good opportunities to do this because this book is richer than most textbooks in examples 
for you to work on. There are two types of these. We call a drill problem (one that makes 
you repeat the kind of calculations that are used in our examples) an “exercise.” These 
test your ability to solve simple problems without offering much in the way of difficulty. 
Some of these have answers given at the back of the book (these exercises are marked). 

However, the real world is full of problems that are not so simple. We provide a 
second kind of problem, in a section labeled “Applied Problems and Projects,” that should 
be more challenging. While we make no claim that truly realistic problems will be found 
there, many are taken from various fields of science and engineering. You should find 
these more exciting, especially because some have no “answer” in the normal sense. 
These are provided to make you think. Some are extensions of things that we only touch 
on in the text, 

Some remarks on computation techniques are in order. We think you will get a better 
feel for the various algorithms if you do some or all of the exercises by hand, using a 
calculator and writing down the intermediate results. We also think you should write 
some programs of your own to solve the exercises; doing so will assure you that you 
really understand the algorithm. A third method is to use a prewritten program or sub- 
routine. In addition to those given (and copies of these are available to your instructor 
on disks in both FORTRAN and Pascal), you should gain some experience in using 
routines from one or more of the standard libraries such as IMSL. Which of these methods 
you use probably depends on your interests and what your instructor feels is most impor- 
tant. Hopefully you will do more than use this book as a cookbook of numerical recipes. 
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EXERCISES* 


tion 1.2 


(r1.) The equation e* — 3x has a root at r = 0.61906129. Beginning with the interval [0, 1], use 

_/ six iterations of the method of halving the interval to find this root. How many iterations 
would you need to evaluate the root correct to four significant places—that is, |x — r| < 
0.00005? How many for eight places? 


The quadratic (x — 0.4)(x — 0.6) = x? — x + 0.24 has zeros at x = 0.4 and x = 0.6, of 
course. Observe that the endpoints of the interval [0, 1] are not satisfactory to begin the 
interval-halving method, Graph the function, and from this deduce the boundaries of intervals 
that will converge to each of the zeros. If the endpoints of the interval (0.5, 1.0] are used 
to begin the search, what is a bound to the error after five iterations? What is the actual error 
after five repetitions of interval halving? 


Nv 


3. Interval halving applies to any continuous function, not just to polynomials. Find where the 
~—graphs of y = x — 2 and y = Inx intersect by finding the root of Inv — x + 2 = 0 correct 
td four decimals 

Use interval halving to find the smallest positive root of these equations. In each case first 
determine a suitable interval, then compute the root with relative accuracy of 0.5%. 


Pb) ax? = ae T= 
d) 3x3 + 4x2 — 8x -1=0 


Sec 
5 ) The polynomial x3 + x? — 3x — 3 = 0, used as an example in Sections 1.2 and 1.3 where 
‘ the root atx = V3 was approximated, has its other roots atx = —1 andx = —\ 3. Beginning 
with two suitable values that bracket the value — V3, show that the method of linear inter- 
polation converges to that root. 


6. In Exercise 5, if one tried as starting values x = —1.5 and x = —1.7, the function would 
not change sign, and, hence, they do not qualify for beginning the method of linear inter- 
polation, However, the secant method can begin with these values. Use them to begin the 
secant method, How many iterations are needed to estimate the root correct to four decimals? 
Suppose the starting values are — 1.5 and —1.1; which root is obtained by the secant method? 
What root if we begin with —1,5 and —1.25? 

»7. Find where the cubic y = x} — x + 1 intersects the parabola y = 2x”. Make a sketch of the 
two curves to locate the intersections, and then use linear interpolation and/or the secant 

—~. method to evaluate the x-values of the points of intersection. 


~ 
8/\ a) \Use the method of linear interpolation to solve the equations in Exercise 4. 
/ b)/Use modified linear interpolation to solve these equations and compare the rates of con- 
-yergence with those obtained in part (a). 


9. Write the algorithm for the secant method, following the model of the other algorithms of 
this chapter. 


Section 1.4 
pro} Find a root near x = —0.5 of e* — 3x? = 0 by Newton’s method, to six-digit accuracy. 


1, 


*Answers are given at the end of the text for exercises marked by pm. 
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11. The equation e* — 3x? = 0 has a root not only near x = —0.5, but also near x = 4.0. Find 
the positive root by Newton’s method. 

12. Use Newton’s method to solve the equations in Exercise 4. How many iterations are required 

~\ to attain the specified accuracy? 

13. a) Use Newton's method on the equation x? = N to derive the algorithm for the square root 
of N: 

1 
Xin. = nies + x/* 
where Xp is an initial approximation to VN. 
b) Derive similar formulas for the third and fourth roots of N. 
14. a) If the algorithm of Exercise 13 is applied twice, show that 
—._.A+B N 
VN = a heap where N = AB. 
b) Show also that the relative error (error/true value) in (a) is approximately 
(isa) 
8\A + B/” 

15. Expand f(x) about the point x = a in a Taylor series. (See Appendix A if you have forgotten 
this.) Using appropriate terms from this, derive the formula for Newton's method. 

16, (x — D3 — 2) = xt — Sx + 9x? — 7x + 2 = 0 obviously has a root at x = 2, and a 
triple root at x = 1. Beginning with x = 2.1, use Newton's method once, and observe the 
degree of improvement. Then start with x = 0.9, and note the much slower convergence to 
the triple root even though the initial error is only 0.1 in each case. Use the secant method 
beginning with f(0.9) and f(1.1), and observe that just one application brings one quite close 
to the root in contrast to Newton’s method. Explain. 

Section 1.5 

17. Use Muller’s method to solve the problems below. Apply four iterations and determine how 
the successive errors are related to each other. Use starting values that differ by 0.2 from 
each other. 

a) 2x3 + 4x? — 2x — 5 = 0, root near 1.0 (exact value is 1.07816259); 
b) e* — 3x? = 0, root near 4.0 (exact value is 3.73307903); 

c) e* — 3x? = 0, root near 1.0; 

d) tanx — x — 1 = 0, root near 1.1. 

18. Muller's method is sometimes used in a “self-starting” form. Instead of specifying starting 
values that are near a root, the algorithm arbitrarily begins with x» = 0, x, = 0.5, x) = 
—0.5. Use these starting values on the equations of Exercise 17. One is supposed to obtain 
the root nearest the origin by this technique. Will this always be true? 

19. An extension of the self-starting principle in Exercise 18 is to deflate the function after a 
first root is found. (Deflating means forming a new function that has the same roots as the 
original function except the root x = r, by dividing f(x) by (x — r).) Test this by compar- 
ing the graphs of f(x) = x(x — 3)(x — 1), g(x) = x(x — 3), A(x) = x(x — 1), and j(x) = 
(x — I(x — 3). Use this deflating technique to obtain all the roots of the equations in 
Exercise 17, beginning as in Exercise 18. 

20. Since Muller’s method subtracts function values that may be very nearly the same, there is 


a possibility of large relative errors. Investigate this by using starting values that are very 


>21. 
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close to each other (but not near a root) in some of the equations in Exercise 17. The difficulty 
should be exaggerated when f(x) changes very slowly near a root. See if this exaggerated 
effect is exhibited by a polynomial function that has a triple root, 

Muller's method works on complex roots if complex arithmetic is used. Find a complex root 
of x4 + 4x3 + 21x? + 4x + 20 = 0. (Newton’s method can also find complex roots; you 
may wish to make the comparison with Newton's method also, Complex arithmetic is laborious 
by hand; you will probably want to employ computer programs in this exercise.) 


Section 1.6 


22 


24, 


f(x) = e* — 3x? = 0 has three roots, An obvious rearrangement is 

x= +Ver/3. 
Show, beginning with x) = 0, that this will converge to a root near —0.5 if the negative 
value is used, and that it converges to a root near 1.0 if the positive value is used. Show, 
however, that this form does not converge to the third root near 4.0 even when a nearly exact 
starting value is used, Find another form that will converge to the root near 4.0. 
One root of the quadratic x? + x — 1 = 0 = x(x + 1) — 1 is atx = 0.6180. The equivalent 
form x = 1/(x + 1) converges to this root beginning at x) = 1, Carrying four or five decimals, 
how many steps are required to reach the root (correct to four decimals) by linear iteration? 
If Aitken acceleration is used after three approximations are available, how many iterations 
are required? 
The form x = 1/(x + 1) of Exercise 23 will converge to a root of the quadratic for many 
starting values in addition to x) = 1. For what starting values will it not converge to a root? 
(For this problem do not stop with a division by zero; that is, for x) = —1, x; = 1/0, Use 
4X3 = lim,» 1/(x + 1) = 0.) 
The cubic 2x3 + 4x? — 2x — 0 has a root near x = 1. Find at least three rearrangements 
that will converge to this root beginning with x) = 1.0 


26, Show that for the points x, x1, 2, x3 in Exercises 22 and 23 the value 
aoa = 33 Dan 
| api | 7 a <> ~\2\ 
((S4-USa) (4a - Kaa) 
is close to 1 in absolute value where the sums are from / = 0 to i = 2 
Section 1.7 
27. a) Show that if f(x) has a double root at x = r, then f“(r) = 0. 
\ b) ‘Show that if f(x) has a root of multiplicity m at x = r, then 
fOr) = 0, hid, 2. y pe 
28. If f(x) = 0 has a double root at x = r, so that f/(r) = 0, then the condition 


SOF") 
[Poor 


may not hold for any interval including r. According to Section 1.7, we therefore have, 
for Newton's method, no assurance of convergence to the root at r. The simple equation 
(x — 2)? = 0 = x? — 4x + 4 has such a double root at x = 2, and f'(x) = 2x — 4 is 
zero at x = 2. Still, beginning at any finite value of xg, Newton's method will converge! 
Reconcile this fact with the convergence criterion of Section 1.7. 


<1 


CHAPTER |: SOLVING NONLINEAR EQUATIONS 


»29. For the quadratic x — 4x + 4 = 0, which has a double root at x = 2, begin with x) = 1 
and compute successive approximations to the root by Newton’s method. Tabulate the errors 
at each step and compafé each with the next. Is Newton’s method quadratically or only linearly 
convergent in this case? How could one accelerate the convergence? 

30. If f(x), f'(@), f"@) are continuous and bounded on a certain interval containing x = r, and 
if both f(r) = 0 and f'(r) = 0, but f"(r) + 0, show that the form 


f(x,) 

=x,-2 

n+l — Xq 2 FG) 

will converge quadratically if x, is in the interval. (Hint: The algorithm is of the form 
X,-) = g(x,). Show that g'(r) = 0, using L’Hopital’s rule.) 


x, 


31. The method suggested by Exercise 30 extends to a root of multiplicity m: 


LO) 


Xe ae Ler 


Show, with suitable restrictions on f(x), that this is quadratically convergent. 


Section 1.8 
2 


»32\ Two of the four roots of the quartic x* + 2x3 — 7x? + 3 = 0 are positive. Find these by 
| Newton's method correct to seven decimal places using a calculator, and then determine the 
‘other roots from the reduced equation. Use synthetic division. 


33. a) If p(x) has a root of multiplicity m at x = r, show that the function q(x) = p(x)/p'(x) 

_ has the same roots as p(x) but just a simple root atx = r. 

g \b) Show that in fact all the roots of q(x) are simple. 

X »34)\ Let Po) = x4 + 4.6x? + 6.6x? — 11x — 14. Use synthetic division to evaluate P(—1). 
‘ SS) Write P(x) as a product of two polynomials, one of degree 3. Find the one positive real root 
~~ of this cubic polynomial and then use the quadratic formula to find the last two roots. This 

“process is called deflation. 
35. Let P(x) = ayx" + apx™™! + +++ + a,x + a,.,. Write a computer program (any language) 
that evaluates P(x) and P'(x) by synthetic division 


Section 1.9 


»36. Beginning with the trial factor x? — 4x + 5, improve by successive applications of Bairstow’s 
method to find the quadratic factors of 


ee BEB BxF ee tS 25 


What are the four zeros of this polynomial? 


a 


Solve Exercise 32 by Bairstow’s method to get quadratic factors. 


38. None of the roots of this fourth-degree polynomial is real. Find these complex roots by 
resolving into quadratic factors: 


xt + 4x3 + 21x? + 4x + 20 =0. 


39. When the modulus of one pair of complex roots is the same as for another pair of complex 
roots, the Bairstow method is slow to converge. Try your patience on 


Ei Sane is eet PEE 


which has as factors (x? + 0.618034x + 1)(x? — 1.618034x + 1). Start with x? + 0.6x + 
1. What is the modulus of the roots? 
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40. Use the QD algorithm to approximate the roots of 


bay xP 2? = Beh 1 = 0, b) 3x3 + 3x? —- 8x -1=0 
¢) xt — 5x3 +:9x? — Tx + 2=0 wd) at = 3.1 + 2.1? + Lie +52, =0 


41. The QD algorithm is also not very efficient on Exercise 39. Solve that problem by QD. 


Section 1|.11 


42. In Section 1.11, an example of a truncated Taylor series for e" is given. Use that cubic 
equation to approximate e* for values of x from 0 to 0.5, using at least eight-decimal precision. 
Make a graph of the errors versus x. Also graph the upper bounds on the errors versus x, 
(See Appendix A if you need to refresh your memory of the error term for a Taylor series.) 


43. Repeat Exercise 42 but now truncate at each step in the computations after five decimals. 
Why are the error curves different? 

44. In Exercise 34 you solved for the zeros of a fourth-degree polynomial. Suppose the coefficients 
as given there are inexact. What influence does this have on the values of the roots? Answer 
this question by determining the effect on each of the roots due to a 1% change in each of 
the five coefficients; study the effect of the change in each coefficient separately from the 
changes in the others. Which coefficient makes the greatest difference in the roots when varied 

—~by 1%? 


45, /Exercise 32 has a quartic polynomial for which you found all four roots, Repeat that exercise, 
but now terminate your computations of the first (wo roots after attaining three decimal places 
of accuracy. This means that the reduced equations are different. What is the effect on the 
succeeding roots’? 


Section 1.12 


46, a) For a simplified normalized floating-point system, B = 10, p = 3, -9 Se = +9. How 
many distinct numbers are there? 
b) How many if the numbers are not normalized? 
c) What if p = 4? 
d) What if -3 =e = +3? 
In Exercises 47 through 50, use the floating-point system of Exercise 46(a). 
47. Bind the absolute error of each result, both with “chopping” and with rounding to three digits. 
a) 3.26 x 10-3 + 2.07 x 1074 
>b) 1.96 x 10° — 1.94 x 104 
c) (3.26 x 10-3 + 2.07 x 1074) — 2.01 x 1074 
d) 3.26 x 1073 + (2.07 x 1074 — 2.01 x 1074) 
48. Find the relative errors, both with “chopping” and with rounding to three digits. 
a) 3.28 x 10°? * 6.98 x 108 
b) 3.28 x 1078 * 6.98 x 10-7 
c) (3.28 X 10>? * 6.98 x 103) + 4.82 x 1078 
d) 3.28 x 107? * (6.98 x 10? + 4.82 x 1078) 
49. Find the absolute and relative errors with “chopping” to three digits. 
a) 4.82 x 10? + 8.81 x 108 
b) 1.06 x 10-% + 4.06 x 10? 
mc) 4.82 x 10? + (8.81 x 108 * 4.06 x 10-7) 
d) (4.82 x 10? + 8.81 x 108) + 4.06 x 10>? 
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50. Find examples where: 

a) (a+ b)+c#az (b+) 
>b) (a* b)*#c #a*(b*¥c) 
c)at(b+c)#at*bt+atc 

51. Evaluate the polynomial 2.75x3 — 2.95x? + 3.16x — 4.67 for x = 1.07, using both chop- 
ping after three digits and rounding to three digits. What are the absolute and relative errors? 
Does nested multiplication differ from evaluation in “standard form”? What is the error of 
2.75x3 — (2.95x? — 3.16x) — 4.67? 

52. Determine how floating-point numbers are stored in the computer systems available to you. 
How many decimal digits of accuracy for single-precision numbers? for double-precision? 
What are the execution times for each of the arithmetic operations for single-precision? for 
double-precision? 

»53. Write a computer program to determine the effect of the sums in (a), (b), and (c): 

a) 0.01 added 100 times; 

b) 0.001 added 1000 times; 

c) 0.0001 added 10,000 times. 

Print out values for sums that should equal 0.1, 0.2... - , 1.0. 

d) Evaluate the divergent infinite series, 1 +} +4 +3 + , with a computer program. 
Why is it not divergent in a computer? 

e) Compare the sum in part (d) when evaluated from right to left with the sum evaluated 
from left to right. Explain the differences. 
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54, Given 


x" +x+2y' ty =f(t), 


x"-x+y= g(t), x(0) = x'(0) = y(0) = 0. 


In solving this pair of simultaneous second-order differential equations by the Laplace trans- 
form method, it becomes necessary to factor the expression 


(S? + 1S) — 25 + 108 - |) = -8K - F + 38 +1, 
so that partial fractions can be used in getting the inverse transform. What are the factors? 
55. DeSantis (1976) has derived a relationship for the compressibility factor of real gases of the 


form 


BAS Gr oe ests ioe 


a ay 
where y = b/(4v), b being the van der Waals correction and v the molar volume. If z = 
0.892, what is the value of y? 

56. In studies of solar-energy collection by focusing a field of plane mirrors on a central collector, 
Vant-Hull (1976) derives an equation for the geometrical concentration factor C: 


ae a(h/cos APF 
~ 0.57D%1 + sin A — 0.5 cos A) 
where A is the rim angle of the field, F is the fractional coverage of the field with mirrors, 


D is the diameter of the collector, and h is the height of the collector. Find A if h = 300, 
C = 1200, F = 0.8, and D = 14. 


aT. 


58. 


59. 


60. 
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Lee and Duffy (1976) relate the friction factor for flow of a suspension of fibrous particles 
to the Reynolds number by this empirical equation: 


1 i an fi 5.6 
== (_jimEW i) + (14 - *}. 
Vf \k k 
In their relation, f is the friction factor, RE is the Reynolds number, and k is a constant 
determined by the concentration of the suspension. For a suspension with 0.08% concentration, 
k = 0.28, What is the value of f if RE = 3750? 
The Redlich-Kwong equation is 
RT ACD) 
v-—b vv + db) 


Measurements of P = 87.3, T = 486.9, and v = 12.005 have been made. It is known that 
A(T) = 0.0837 under these conditions. R = 1.98, a constant. Find the value of b that satisfies 
the equation, 


Based on the work of Frank—Kamenetski in 1955, temperatures in the interior of a material 
with embedded heat sources can be determined if we solve this equation: 


eV eggh-NeW/2") = Wik 
Given that L,, = 0,088, find 1 


Suppose we have the 555 Timer Circuit 


Vie 
9 


Output 


whose output waveform is 


Pees 
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61. 


63, 


65. 


where 
ae 1 
th 
1 2 ig 
f = frequency 
qT 
Duty cycle = ——1— x 100%. 
se hea TG 
It can be shown that 
T, = R,Cln(2) 
= pate « (a — 7) 
cam! Ae 2R, — Resi)” 


Given that R, = 8670, C = 0.01 x 10-*, T, = 1.4 x 107%, find 


a) T,, f, and the duty cycle, 
b) Rg using SUBROUTINE PLNINT. 
c) Select an f and duty cycle, find 7, and T>. 


The solution of boundary-value problems by an analytical (Fourier series) method often 
involves finding the roots of transcendental equations to evaluate the coefficients. For example, 
yt+aAy=0, yO=0, 1) =y'M), 

involves solving tan z = z. Find three values of z other than z = 0. 
Find all max/min points of the function 
F(x) = (sin(x))® * e7° * tan(1 — x) 
on the interval (0, 1]. Compare your own root-finding program with the IMSL subroutine 
ZBRENT. (Note the disadvantage in trying to solve f'(x) = 0 using Newton's method.) 
In Chapter 4, a particularly efficient method for numerical integration of a function, called 
Gaussian quadrature, is discussed. In the development of formulas for this method it is 
necessary to evaluate the zeros of Legendre polynomials. Find the zeros of the Legendre 
polynomial of sixth order: 
Pex) = 4(693x° — 945x4 + 315x? — 15). 
(Note: All the zeros of the Legendre polynomials are less than one in magnitude and, for 
polynomials of even order, are symmetrical about the origin.) 
The Legendre polynomials of Problem 63 are one set of a class of polynomials known as 
orthogonal polynomials. Another set are the Laguerre polynomials. Find the zeros of the 
following: 
a) L(x) = x3 — 9x? + 18k — 6 b) L(x) = x* — 16x3 + 72x? — 96x + 24 
Still another set of orthogonal polynomials are the Chebyshev polynomials. (We will use these 
in Chapter 10.) Find the roots of 
Te(x) = 32x® — 48x* + 18x? — 1 = 0. 
(Note the symmetry of this function. All the roots of Chebyshev polynomials are also less 
than one in magnitude.) 
A sphere of density d and radius r weighs 3zr3d. The volume of a spherical segment is 
4ar(3rh? — h3), Find the depth to which a sphere of density 0.6 sinks in water as a fraction 
of its radius. (See Fig. 1.23.) 
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67. Write a subroutine that performs the QD algorithm of Section 1.10. Include an algorithm in 
the style of this chapter and a flowchart. Your routine should be able to handle both real and 
complex roots. Test it with several well-chosen examples, 

68. a) Rewrite the subroutine NEWTN to handle complex roots and test it, 

b) Do the same for subroutine MULLR. 

69. Rewrite program BRSTOW so it will handle polynomials with complex coefficients, Test it 

with 
x5 — (1.5 — 2.2i)x = 0. 


70, Write a program that uses linear interpolation to get a first approximation to a root of 
J(x) = 0 and then calls subroutine NEWTN to refine it. 


=. 


Solving Sets of Equations 


2.0 CONTENTS OF THIS CHAPTER 


Chapter 2 covers the most important topic of how to solve large systems of linear 
equations with efficiency and accuracy. A linear system is perhaps the most 
widely applied mathematical technique when real-world situations are simulated. 


2.1 VOLTAGES AND CURRENTS IN A NETWORK 
Is a typical application of simulating a real-world problem by a system of equations 

2.2. MATRIX NOTATION 
Describes how matrices can represent a system of equations in a compact form 
that facilitates their manipulation 

2.3. ELIMINATION METHOD 
Is a review of the classical procedure for solving equations simultaneously 

2.4 GAUSS AND GAUSS-JORDAN METHODS 
Are adaptations of the familiar elimination method that give improved accuracy 
and efficiency, and lead to using a computer to do the computations 

2.5 OTHER LU METHODS 
Shows that the computations used to reduce a system of equations can be saved 
within the coefficient matrix, allowing one to resolve the system of equations with 
new right-hand-side values with a minimum of work 

2.6 PATHOLOGY IN LINEAR SYSTEMS—SINGULAR MATRICES 
Points out that there are times when an accurate solution to a system of equations 
is extremely difficult to obtain and gives techniques to minimize the problem; also 
introduces you to some important properties of matrices 
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2.1 


Figure 2.1 
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2.7 DETERMINANTS AND MATRIX INVERSION 
Explains how the methods of solving systems apply to getting the determinant and 
the inverse of a matrix 
2.8 NORMS 
Are measures of the magnitude of matrices and vectors 
2.9 CONDITION NUMBERS AND ERRORS IN SOLUTIONS 
Explores how we can know when a solution is subject to error and what we can 
do to improve the accuracy of the solution 
2.10 ITERATIVE METHODS 
Discusses another approach to solving linear systems that is preferred in certain 
applications 
2.11 RELAXATION METHOD 
Is historically important and provides background for a technique for accelerating 
the convergence of the iterative technique 
2.12. SYSTEMS OF NONLINEAR EQUATIONS 
Are very much harder to solve than linear systems but sometimes must be used 
to have a good mathematical model of the problem; methods are described to do 
this tough job 
2.13 CHAPTER SUMMARY 
Tells you what you should understand after studying the chapter 
2.14 COMPUTER PROGRAMS 
Gives programs and subroutines that implement the methods of the chapter 


VOLTAGES AND CURRENTS IN A NETWORK 


Electrical engineers often must find the currents flowing and voltages existing in a complex 
resistor-network. Here is a typical problem, 

Seven resistors are connected as shown, and voltage is applied to the circuit at points 
| and 6 (see Figure 2.1). You may recognize the network as a variation on a Wheatstone 
bridge. 


® R, = 14 @ 
¢$——_ww+——a 
i ee a 
R,= 0.1 
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Suppose we want to know the current that would flow between points 3 and 4. One 
method would be to construct the network, apply the voltage, and measure the current 
that flows with an ammeter. A preferred technique is to compute the current by applying 
the laws of physics. 

While we are especially interested only in finding the current that flows through the 
ammeter, the computational method can give the voltages at each numbered point (these 
are called nodes) and the current through each of the branches of the circuit. Two laws 
are involved: 


Kirchhoff’s law: the sum of all currents flowing into a node is zero. 
Ohm’s law: the current through a resistor equals the voltage across it divided by 


its resistance. 


We can set up eleven equations using these laws and from these solve for eleven 
unknown quantities (the four voltages and seven currents): 


Currents flowing into the four nodes: 


— ix, — ing = 0 (node 2) 
0 (node 3) 
0 (node 4) 


0 (node 5) 


tas 


"45 


igs ise 


Currents through the resistors: 


5 - V5 


These equations provide 11 linear equations in 11 unknown variables. This chapter 
describes methods for solving for these variables in ways that are adapted to computers. 
Without a computer to do the work, the job is tedious and it is easy to make computational 
errors. We will usually rearrange the equations into a standard form, however. Here is 
how they should be rearranged. 


2.2 
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in3, — tog =0 
by — bg — bbs = 
ing + bya — tgs = 0) 

as + igs — t56 =0 

14i9 + V3 =5 


6124 — Vz + V4 =0 
O.1ig4 = V5 + V4 = 0. 
Tiys - V3 +Vs =0 


MATRIX NOTATION 


Our discussion of methods to solve sets of linear equations will be facilitated by some 
of the concepts and notation of matrix algebra, Only the more elementary ideas will be 
needed. 

A matrix is a rectangular array of numbers in which not only the value of the number 
is important but also its position in the array. The size of the matrix is described by the 
number of its rows and columns. A matrix of 2 rows and m columns is said to be n * 
m. The elements of the matrix are generally enclosed in brackets, and double-subscripting 
is the common way of indexing the elements. The first subscript always denotes the row 
and the second denotes the column in which the element occurs. Capital letters are used 
to refer to matrices, For example, 


Ay Aya 6 + Aim 
42, 422+ + + Gam 

A=]. = [a;), i= 1,2, Pv rarer 7) 
Any Anas + + nm 


Enclosing the general element a,, in brackets is another way of representing matrix A, as 
shown above. 


Two matrices of the same size may be added or subtracted. The sum of 


A = layl and B= [bij 


is the matrix whose elements are the sum of the corresponding elements of A and B, 
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Similarly we get the difference of two equal-sized matrices by subtracting corresponding 
elements. If two matrices are not equal in size, they cannot be added or subtracted. Two 
matrices are equal if and only if each element of one is the same as the corresponding 
element of the other. Obviously, equal matrices must be of the same size. Some examples 
will help make this clear. 


If 


Bleek aK _ Te <3" 2 
a=[_j 0 | ae a-|f =) “ih 


we say that A is 2 X 3 because it has two rows and three columns. B is also 2 X 3. 
Their sum C is also 2 x 3: 


a8 NS 
cxata-2 $3 


The difference D of A and B is 
Se es ee 
Simm nels aa 


Multiplication of two matrices is defined as follows, when A is n X m and B is m X r: 


(ayy, + Qyaba) + ° ++ + Aim bmi)» «Arde + + + Qimbmr) 
(€31by, + Az2ba, + > +> + Gam bmi) © © « (G2 Byy + + + * + Gambmr) 


[aj]Iby] = fey) = 


(yyy, + Gqaboy + °° + + GpmBmi) - » ~ Andy + °° * + Gpmbmr) 


It is simplest to select the proper elements if one counts across the rows of A with the 
left hand while counting down the columns of B with the right. Unless the number of 
columns of A equals the number of rows of B (so the counting comes out even), the 
matrices cannot be multiplied. Hence if A is nm X m, B must have m rows or else they 
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are said to be “nonconformable for multiplication” and their product is undefined. In 
general AB + BA, so the order of factors must be preserved in matrix multiplication 

If a matrix is multiplied by a scalar (a pure number), the product is a matrix, each 
element of which is the scalar times the original element. We can write 


I#kA=C,  ¢; 


= kaj. 


j 


A matrix with only one column,  * 1 in size, is termed a column vector, and one 
of only one row, 1 * m in size, is called a row vector, When the unqualified term vector 
is used, it nearly always means a column vector. Frequently the elements of vectors are 
only singly subscripted. 

Some examples of matrix multiplication are 


240 = 2 4 yy 
Suppose A = [ * | B=| 0 1], x= ]-2), y=Iyo 
=i 2 3 
4 -1 1 3 
“5 -4 0 6 
A*B= B 3) BeA=|-1 2 Bh 
~ 9 4 3 


ell OH, ya | 2y + 4y2 
ne [ } Ay = hes + 2y + 3y3 


Since A is 2 ¥ 3 and B is 3 * 2, they are conformable for multiplication and their product 

is 2 x 2. When we form the product of B * A, it is 3 * 3. Observe that not only is 

AB + BA; AB and BA are not even the same size. The product of A and the vector x 

(a 3 * 1 matrix) is another vector, one with two components. Similarly, Ay has two 

components. We cannot multiply B times x or B times y; they are nonconformable. 
The product of the scalar number 2 and A is 


4.8 0 
z= [3 4 if 


Since a vector is just a special case of a matrix, a column vector can be multiplied 
by a matrix, so long as they are conformable in that the number of columns of the matrix 
equals the number of elements (rows) in the vector, The product in this case will be 
another column vector. The size of a product of two matrices, the first m * n and the 
second n * r,ism * r. Anm X n matrix times ann * | vector gives an m | product. 

The general relation for Ax = b is 


No. of cols. 


No. of rows. 
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Two vectors, each with the same number of components, may be added or subtracted. 
Two vectors are equal if each component of one equals the corresponding component of 
the other. 

This definition of matrix multiplication permits us to write the set of linear equations 


44, X1 + ay2X) + ++ + + iyky = By, 
n,X, + Gy9X + ++ * + Ay_X, = bo, 
G31 PGp% + -*+ + a3, = b,, 


much more simply in matrix notation, as, Ax = b.| where 


For example. 


is the same as the set of equations 


3x, + 2x, + 4x, = 14, 
yeaa al 
ag FP Say xg = 2. 


A very important special case is the multiplication of two vectors. The first must be 
a row vector if the second is a column vector, and each must have the same number of 
components. For example, 


4 
{1 3 —2}*]-1] =[-5] 
x 


gives a “matrix” of one row and one column. The result is a pure number, a scalar. This 
product is called the scalar product of the vectors, also called the inner product. 
Certain square matrices have special properties. The diagonal elements are the line 
of elements a;; from upper left to lower right of the matrix. If only the diagonal terms 
are nonzero. the matrix is called a diagonal matrix. When the diagonal elements are each 
equal to unity while all off-diagonal elements are zero, the matrix is said to be the identity 


EXAMPLE 
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matrix of order n. The usual symbol for such a matrix is /,,, and it has properties similar 
to unity. For example, the order-4 identity matrix is 


0 
0 
o| =e 
l 


1 0 
01 0 
001 
0 0 0 
The subscript is omitted when the order is clear from the context. 


A vector that has all its elements equal to zero except one element, which has a 
value of unity, is called a unit vector. There are three distinct unit vectors for order-3 


vectors; they are 
1 0 0 
oO}, 1], and oO}. 
0 0 1 


If all the elements above the diagonal are zero, a matrix is called Jower-triangular, 
it is called upper-triangular when all the elements below the diagonal are zero, For 
example, these order-3 matrices are lower- and upper-triangular: 


1 0 0 | a 
L=| 4 6 O}|, U=/0 -1 0}. 
—2 1 -4 0 oO 1 


Tridiagonal matrices are those that have nonzero elements only on the diagonal and in 
the positions adjacent to the diagonal; they will be of special importance in certain partial- 
differential equations, An example of a tridiagonal matrix is 


-4 2 0 0 90 
t) 4 1 0 Oo 
0 1 -4 1 0}. 
0 0 1 -4 1 
0 0 0 2 -4 


The transpose of a matrix is the matrix that results when the rows are written as 
columns (or, alternatively, when the columns are written as rows). The symbol A? is 
used for the transpose of A. 


Fi | 4 3. 0 1 
A=|]0 2 -3), AT=]-1 2 1 
1 1 2 4 -3 2. 


It should be obvious that the transpose of A? is just A itself. 

When a matrix is square, a quantity called its rrace is defined. The trace of a square 
matrix is the sum of the elements on its main diagonal. For example, the traces of the 
above matrices are 

m(A)=3+2+2=7; w(A7)=3+2+2=7. 
It should be obvious that the trace remains the same if a square matrix is transposed. = 
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We present here some additional examples of arithmetic operations with matrices. 


rae 1 2) Jes 
a} ‘l-f sl 


1,3, 2) »[=1- 0 2] jonseeey 
-104 4 1-3) Bae 


3 Uf -Pe) mm [fa] [2] anes 
E ald Egh we EE ee a]-F Sh 


Division of a matrix by another matrix is not defined, but we will discuss the inverse 
of a matrix later in this chapter. 

The determinant of a square matrix is a number. For a 2 < 2 matrix, the determinant 
is computed by subtracting the product of the elements on the minor diagonal (from upper 
right to lower left) from the product of terms on the major diagonal. For example 


A= F il det (A) = (34) — (—1)(2) = 14; 
det (A) is the usual notation for the determinant of A. Sometimes the determinant is 
symbolized by writing the elements of the matrix between vertical lines (similar to 
representing the absolute value of a number). 

For a 3 x 3 matrix, you may have learned a crisscross way of forming products of 
terms (we call it the “spaghetti rule”) that probably should be forgotten, for it applies 
only to the special case of a 3 x 3 matrix; it won't work for larger systems. The general 
tule that applies in all cases is to expand in terms of the minors of some row or column. 
The minor of any term is the matrix of lower order formed by striking out the row and 
column in which the term is found. The determinant is found by adding the product of 
each term in any row or column by the determinant of its minor, with signs alternating 
+ and —. We expand each of the determinants of the minor until we reach 2 x 2 matrices. 
For example, 


3 0 -1 2 

: _|4 1 3 -2 

Given A = | 5 Tees 3) 

bse AE" 14 
1 3-2) |4 3 -2 4 1 -2 re ee: | 
det (A)= 3/2 -1 3/-0)/0 -1 3)/+(-N/0 2 3)-2/0 2 -1 
jo 1 4| Ly Fabs 1 0 4 | 


2.3 
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2 lg B||arpatl2- || snd cecls 2 
= 3{«| 1 i af rea 9 it 
2. 8] ax(0l. 3 = 5310 al} 
+ eof | ow? || iC on 0 
2-1 {0 -1 0 2 
- 2/5 | Dy i|+ a al} 


3{(1)(—7) — (3)(8) + (—2)(2)} + (—1{(4(8) — (L(—3) + (—2)(—2)} 
— 24(4)(2) — (1)(1) + (3)(—2)} 

3(-7 — 24 — 4) + (-1)32 +3 + 4) — 78-1 - 6) 

3(—35) + (—1)(39) — 211) = -146. 


MW 


In computing the determinant, the expansion can be about the elements of any row 
or column. To get the signs, give the first term a plus sign if the sum of its column 
number and row number is even, give it a minus if the sum is odd, with alternating signs 
thereafter. (For example, in expanding about the elements of the third row we begin with 
a plus; the first element a3, has 3 + 1 = 4, an even number.) Judicious selection of rows 
and columns with many zeros can hasten the process, but this method of calculating 
determinants is a lot of work if the matrix is of large size. Methods that triangularize a 
matrix, as described below, are good methods for getting determinants. 


ELIMINATION METHOD 


The first method we will study for the solution of a set of equations is just an enlargement 
of the familiar method of eliminating one unknown between a pair of simultaneous 
equations. It is generally called Gaussian elimination and is the basic pattern of a large 
number of methods that can be classed as direct methods, (This is to distinguish them 
from indirect, or iterative, methods, which we will discuss later.) 

Consider the simple example of three equations 


3x, — x, + 2x3 = 12, 
xX, + 2x). + 3x3 = 11, 
2x, — 2%, - x43 = 2. 


Multiplying the first equation by —1 and the second by 3 and adding will eliminate 
x,. Similarly, multiplying the first by —2 and the third by 3 and adding also eliminates 
x,. (We prefer, in hand calculations, to multiply by the negative values and add, to avoid 
making mistakes when subtracting quantities of unlike sign.) The result is 


3x, — 4 +2x3= 12, 
7x2 + 7x3 = 21, 
—4x, — 7x, = -18. 


| 
2 
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We eliminate x2 between the second and third equations by multiplying the second by 4 
and the third by 7 and adding. (Of course, just adding them as they stand would eliminate 
X3, which is equally satisfactory, but we wish to keep our method systematic to lead up 
to an algorithm that can be readily programmed.) After this operation we have the upper- 
triangular system 


3x) — x2 + 2x3= 12, 
Tat t= 21, 
— 2by 


i 
| 
rs 
s 


Obviously x; = 2 from the third equation, and back-substitution gives x, = | (from the 
second equation), and x, = 3 (from the first equation). 
We now present the same problem, solved in exactly the same way, in matrix notation: 


3-1 2x 12 
1 2 3x) =111}. 
22, =), 2 


The arithmetic operations we have performed affect only the coefficients and the 
constant terms, of course, so we work with the matrix of coefficients “augmented” with 
the b-vector: 


(The dotted line is usually omitted.) 
We perform elementary row transformations* to convert A to upper-triangular: 


a =i) Deus ae ee ea 0 
1 2 3 /3R,+(-DR,>|0 7 7 21 
-1 


2 -2 2] 3R; + (—2)R, > 10 -4 -7 -18 


S| 2 Ww 
: 7 7 21}. 
7R;+4R,—>|0 O —-21 —42 (2.1) 
The steps are to add 3 times the second row to —1 times the first row and 3 times the 
third row to —2 times the first row. The next phase adds 7 times the third row to 4 times 
the second row. 

We are now ready for back-substitution. Note that, except for notation and termi- 
nology. there is nothing new here. We depend on our memory to know which numbers 
in the converted augmented matrix are coefficients and which are the right-hand sides 
(the constant terms). 


*Elementary row operations are arithmetic operations that are obviously valid rearrangements of a set of 
equations: (1) any equation can be multiplied by a constant; (2) the order of the equations can be changed; (3) 
any equation can be replaced by its sum with another of the equations. 
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The back-substitution step can be performed quite mechanically by eliminating the 
coefficients above the diagonal in (2.1). Adding the third row of (2.1) to 3 times the 
second row, and adding twice the third row to 21 times the first row gives 


63-21 0 168 
0 21 0 21 
0 0 -21 42 


We finish the elimination of off-diagonal elements by adding the second row to the 
first: 


63 0 0 189 
0 21 0 21}. 
0 0 -21 -42 


If we divide each row by the diagonal element, we get a form in which the elements 
of x, the vector whose components are the unknowns x), / = 1, 2,... . m, are equal to 
the components of the transformed b-vector: 


1003 3 
0101, x=]1), 4 =3 m=1, 4 =2 
0012 2 


Thinking of this procedure in terms of matrix operations, we transform the augmented 
coefficient matrix by elementary row operations until the identity matrix is created on the 
left. The x-vector then stands as the rightmost column. 

Note that there exists the possibility that the set of equations has no solution, or that 
the above procedure will fail to find it. During the triangularization step, if a zero is 
encountered on the diagonal, we cannot use that row to eliminate coefficients below that 
zero element. However, in that case, we can continue by interchanging rows and even- 
tually achieve an upper-triangular matrix of coefficients. The real stumbling block is 
finding a zero on the diagonal after we have triangularized. If that occurs, the back- 
substitution fails for we cannot divide by zero. 

It is worthwhile to explain in more detail what we mean by the elementary row 
operations that we have used above, and to see why they can be used in solving a linear 
system. There are three of these operations: 


We may multiply any row of the augmented coefficient matrix by a 
nonzero constant. 
We can add a multiple of one row to a multiple of any other row, 


We can interchange the order of any two rows (this was not used above), 


The validity of these row operations is intuitively obvious if we think of them applied 
to a set of linear equations. Certainly, multiplying one equation through by a constant 
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2.4 


does not change the truth of the equality. Adding equal quantities to both sides of an 
equality results in an equality, and this is the equivalent of the second transformation. 
Obviously the order of the set is arbitrary, so rule 3 is valid. 

These operations, which do not change the relationships represented by a set of 
equations, can be applied to an augmented matrix, because this is only a different notation 
for the equations. (We need to add one proviso to the above. Since round-off error is 
related to the magnitude of the values when we express them in fixed-word-length com- 
puter representations, some of the above operations may have an effect on the accuracy 
of the computed solution.) 

We should also observe that the “back-substitution” phase, when it is done by making 
the coefficients above the diagonal zero and then reducing the coefficient matrix to the 
identity matrix, is exactly the same as back-substitution in the more explicit sense. The 
order of operations is changed but each of the steps is identical. 


GAUSS AND GAUSS-JORDAN METHODS 


While the procedure of the previous section is satisfactory for hand calculations on small 
systems, there are several objections that we should eliminate before we write a computer 
program to perform Gaussian elimination. In a large set of equations, and that is the 
situation we must prepare for, the multiplications will give very large and unwieldy 
numbers that may overflow the computer's registers. We will therefore eliminate the first 
coefficient in the ith row by subtracting a;)/a,, times the first equation from the ith 
equation. (This is equivalent to making the leading coefficient 1 in the equation that 
retains that leading term.) We use similar ratios of coefficients in eliminating coefficients 
in the other columns. 

We must also guard against dividing by zero. Observe that zeros may be created in 
the diagonal positions even if they are not present in the original matrix of coefficients. 
A useful strategy to avoid (if possible) such zero divisors is to rearrange the equations 
so as to put the coefficient of largest magnitude on the diagonal at each step. This is 
called pivoting. Complete pivoting may require both row and column interchanges. This 
is not frequently done. Partial pivoting, which places a coefficient of larger magnitude 
on the diagonal by row interchanges only, will guarantee a nonzero divisor if there is a 
solution to the set of equations, and will have the added advantage of giving improved 
arithmetic precision. The diagonal elements that result are called pivot elements. (When 
there are large differences in magnitude of coefficients in one equation compared to the 
other equations, we may need to scale the values; we consider this later.) 

We repeat the example of the previous section, incorporating these ideas and carrying 
four significant digits in our work. We begin with the augmented matrix. 


Ne 
| 


A 
Row 2— (j)Rowl1—>]0 2.333 2.334 7.004 
Row 3 — (Row 1— [0 1.334 —2.3 
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3-1 2 12 
a4 0 2.333 2.334 7.004 
Row 3 ~ ( Row 2>[0 0 ~~ ~1.000  ~1.993 


The method we have just illustrated is called Gaussian elimination. (In this example, 
no pivoting was required to make the largest coefficients be on the diagonal.) Back- 
substitution, beginning with the third equation and then moving “backward” to the second 
and first equations, gives x; = 1.993, x. = 1.008, x, = 3.007. The differences of these 
values from 2, 1, and 3 are due to the effects of round-off error. In this example, we 
have truncated after the fourth digit rather than rounding to the nearest fourth digit, which 
is the same as the arithmetic performed in many computers. In either case, the errors. 
similarly affect the accuracy of the results. When there are many equations, the effects 
of round-off (the term is applied to the error due to chopping as well as when rounding 
is used) may cause large effects. In certain cases, the coefficients are such that the results 
are particularly sensitive to round-off; such systems are called i/l-conditioned., 

In the example just presented, the zeros below the main diagonal show that we have 
reduced the problem to solving an upper-triangular system of equations as in Section 2.3. 
However, at each stage, if we had stored the ratio of coefficients in place of zero, our 
final form would have been 


2 12 


(0.3333) 2.333 2.334 7.004 |. 
(0.6667) (—O.5711) 1.000  ~1.993 


Then in addition to solving the problem as we have done, one finds that the original 
matrix 


can be written as the product 


1 0 


0.3333 1 
0.6667 —0.5711 


to four decimal places. This is called an LU decomposition of A. In Section 2.5, other 
methods for finding the LU decomposition of a matrix will be presented. If a matrix can 
be written in the LU form such as that presented here, the number of computations needed 
in solving the system of equations is reduced significantly. 
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Let us summarize the operations of Gaussian elimination in a form that will facilitate 
writing a computer program. We use a less formal method of stating this algorithm. 


Gaussian Elimination 
To solve a system of linear equations, 


1, Augment the n X n coefficient matrix with the vector of right-hand 
sides to form an X (n + 1) matrix. 

2. Interchange rows if necessary to make the value of aj, the largest 
magnitude of any coefficient in the first column. 


3, Create zeros in the second through nth rows in the first column by 
subtracting a; /a,, times the first row from the ith row. Store the a;; /a), 
MGs BS Zhen x hs 

4. Repeat steps 2 and 3 for the second through the (n — 1)st rows, putting 
the largest-magnitude coefficient on the diagonal by interchanging rows 
(considering only rows j to n), and then subtracting a,;/a) times the 
jth row from the ith row so as to create zeros in all positions of the jth 
column below the diagonal. Store the a,;/a;; in aj,i=j+1,..., 
n. At the conclusion of this step, the system is upper-triangular. 

5. Solve for x, from the nth equation by 


Xn = Qnn+1/Gnn: 


6. Solve for X,-;, X,-2, . . . . X; from the (n — 1)st through the first 
equation in turn, by 


Some computer programs do not actually interchange all the elements of the rows 
when pivoting. In these programs, one keeps track of the order in which the rows are to 
be used in a vector whose elements represent row order. When an interchange is indicated, 
only the elements of this ordering vector are changed. These numbers are then used to 
locate the positions of the elements in the matrix of coefficients that are to be operated 
on, both during the reduction step and during the back-substitution. This can reduce the 
computer time for large systems, but adds to the complexity of the program. 

The algorithm for Gaussian elimination will be clarified by an additional numerical 
example. 
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Solve 
2x + m= 0, 
2x, + 2x2 + 3x3 + 2x4 = —2, 
4x, — 3x + = =T; 


6x, + Xx — 6x3 — Sxy = 


The augmented coefficient matrix is 


0 2 0 1 0 
2 2 3 2. 2 
4 -3 0 L 7 
6 1 -6 -5 6. 


We cannot permit a zero in the a,, position because that element is the pivot in reducing 
the first column. We could interchange the first row with any of the other rows to avoid 
a zero divisor, but interchanging the first and fourth rows is our best choice. This gives 


6 L Gi —S 6 
2 2 3 B °-=2 
4-3 0 Le 
0 2 40 1 0. 


We make all the elements in the first column zero by subtracting the appropriate multiple 
of row one: 


6 1 =6: =§ 6 
0 1.6667 5 3.6667 —4 
0 -3.6667 4 4.3333 -I1 
Oo 2 0 1 0. 


We again interchange before reducing the second column, not because we have a zero 
divisor, but because we want to preserve accuracy.* Interchanging the second and third 
rows puts the element of largest magnitude on the diagonal. (We could also interchange 
the fourth column with the second, giving an even larger diagonal element, but we do 
not do this.) After the interchange, we have 


6 | =6 -5 6 
—3.6667 4 4.3333 -11 


0 
0 1.6667 5S 3.6667 —4)° 
O' 2 0 1 0. 


Now we reduce in the second column 


6 1 —6 =5 6 

0 3.6667 4 4.333 —11 

0 Oo 6.8182 5.6364 —9.0001 |’ 
0 0 2.1818 3.3636 = —5.9999. 


*A numerical example that demonstrates the improved accuracy when partial pivoting is used will be found in 
Section 2.9. 
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No interchange is indicated in the third column. Reducing. we get 


6 1 -6 -5 6 
0 -3.6667 4 4.3333 -11 
0 Oo 6.8182 5.6364 —9.0001 | 
o 0 0 1.5600 —3.1199. 


Back-substitution gives 


322199, 
= 5000 ~ =1.599; 


_ —9.0001 — 5.6364(—1.9999) 
a 6.8182 
_ ~H — 4.3333(—1.9999) — 4(0.33325) _ 


5 —3.6667 = 1.0000, 


—(-S5y-— -(- 2 — 
y= EE SO ae 


= 0.33325, 


The correct answers are —2, - 1, and + for xs, X3, x2, and x,. In this calculation we 


have carried five significant figures and rounded each calculation. Even so, we do not 
have five-digit accuracy in the answers. The discrepancy is due to round-off. The question 
of the accuracy of the computed solution to a set of equations is a most important one, 
and at several points in the following discussion we will discuss how to minimize the 
effects of round-off and avoid conditions that can cause round-off errors to be magnified. 

In this example. if one had replaced the zeros below the main diagonal with the ratio 
of coefficients at each step. the resulting augmented matrix would be 


6 1 -6 x 6 
(0.66667) —3.6667 4 4.3333 -11 
(0.33333) (—0.45454) 6.8182 5.6364 —9.0001/ 
(0.0) (—0.54545) (0.32) 1.5600 —3.1199. 


This gives the LU decomposition as 


-6 _ 
0.66667 x 4.3333 


0.33333 . 6.8182 3.6364) 
0 1.5600. 
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You should check that the product of these matrices is indeed the original matrix except 
that the rows are interchanged owing to pivoting. The next section explores the usefulness 
of LU formulations more fully. 

There are many variants to the Gaussian elimination scheme. The back-substitution 
step can be performed by eliminating the above-diagonal elements, using elementary row 
operations and proceeding upward from the last row, after the triangularization has been 
finished. This is similar to an example presented in the previous section. The diagonal 
elements may all be made ones as a first step before creating zeros in their column: this 
does the divisions of the back-substitution phase at an earlier time. 

One variant that is sometimes used is the Gauss—Jordan scheme. In it, the elements 
above the diagonal are made zero at the same time that zeros are created below the 
diagonal. Usually the diagonal elements are made ones at the same time that the reduction 
is performed; this transforms the coefficient matrix into the identity matrix. When this 
has been accomplished, the column of right-hand sides has been transformed into the 
solution vector, Pivoting is normally employed to preserve arithmetic accuracy. 

The previous example, solved by the Gauss—Jordan method, gives this succession 
of calculations: 


The original augmented matrix is 


DAhnNnoO 
' 


Interchanging rows one and four, dividing the first row by 6, and reducing the first column 
gives 


1 0.16667 —1 —0.83335 1 
0 1.6667 5 3.6667 —4 
0 -3.6667 4 43334 -1l) 
Oo 2 0 1 0. 


Interchanging rows two and three, dividing the second row by —3.6667, and reducing 
the second column (operating above the diagonal as well as below) gives 


0 1.5000 —1.2000 1.4000 
1 2.9999 2.2000 —2.4000 
0 15.000 12.400 —19.800 

0 -5,9998 —3.4000 4.8000. 


ooor 


No interchanges are required for the next step. We divide the third row by 15.000 and 
make the other elements in the third column into zeros: 


0.04000 —0.58000 
—0.27993 1.5599 

0.82667 —1.3200 

1.5599 + —3.1197 


ooow 
coro 
oroco 
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EXAMPLE 


We now divide the fourth row by 1.5599 and create zeros above the diagonal in the 
fourth column: ad 


—0.49999 
1.0001 


0.33326 |" 
= 2999: 


The solution is essentially the same as with the usual Gaussian method; round-off errors 
have created inaccuracies in a slightly different way than did the previous computation. 

While the above Gauss—Jordan scheme appears to be a duplicate of the work done 
in the standard procedure, a count of the arithmetic operations shows that the Gauss— 
Jordan method requires almost 50% more operations. We therefore do not recommend 
its use. 

A frequently occurring situation is to have to solve a set of equations with the same 
coefficient matrix but with a number of different right-hand sides. For example, in the 
case of interconnected resistances at the beginning of this chapter, we may desire to study 
the effects of different voltages applied at points 1 and 6. In the design of a truss (see 
Problems 62 and 64), one usually wishes to determine the stresses under a variety of 
external loads—this causes only the right-hand-side terms to vary. 

If all of the different right-hand-side vectors are known in advance, the multiple 
solutions of the system can be obtained simultaneously using our Gaussian elimination 
method. One augments the coefficient matrix with all of the right-hand-side vectors, and 
treats each augmentation column in the same way as a single added column. In the back- 
substitution phase, each of the columns is employed to give a solution vector. At the end 
of this chapter, a subroutine is exhibited that does this. 


Solve the system Ax = b, with multiple values of b, by Gaussian elimination: 


Sie = eed 0 =» 2 
— 1 4 0 2 iy 0 Tay mee 1 yc 2 
A og oy Salk b a b 3° Bb ol: 
1 j rie! | 3 ) 4 0. 
We augment A with all of the b’s, then triangularize: 
Ks | 2 On 2 2 
1 4 0 a 0 1 2 = 
2 1 2 —t 1 3 0 
1 rs! a3 0 4 0. 
3 2 =] 2 oF =2 2 
(3.333) 3.333 0.333 1.333 0 1.667 1.333 
(0.667) —0.333 2.667 —2.333 1 4.333, —1.333 
(0.333) 0.333 —0.667 2.333 0 4.667 —0.667 
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3 2 =I 2 0 =2 2 

(3.333) 3.333 0.333 1.333 0 1.667 1.333 = 
(0.667) (—0,100) 2.700 ~2.200 1 4.500 —1.200 
(0.333) (0.100) —0.700 2.200 0 4.500 —0.800. 


3 2 a 

(3.333) 3.333 0.333 

(0.667) (—0,100) 2.700 
0,100) (0.259) 


(0.333) ( 


We obtain the three solution vectors by back-substitution, employing the proper b’ vector 
(These are indicated above as c'”.) 


0.137 ~0.591 0.273 
~0.114 1.340 0.773 
I= Q= 3) = 
a 0.500) * 4.500) * —1.000/ * 
0.159 3.471 ~0.682 


SCALING 


We have mentioned that the rows of the augmented coefficient matrix may need to be 
scaled before a proper choice of pivot element can be made. Scaling is the operation of 
adjusting the coefficients of a set of equations so that they are all of the same order of 
magnitude. In some instances, a set of equations may involve relationships between 
quantities measured in widely different units (microvolts versus kilovolts, for example, 
or nanoseconds versus years). This may result in some of the equations having very large 
numbers and others very small. If we select the pivot elements without scaling, pivoting 
may put numbers on the diagonal that are not large in comparison to others in their row; 
this can actually create the round-off errors that pivoting was supposed to avoid. An 
example will clarify the concept. 

105 

Ix = i 

i 


Carrying only three digits to emphasize round-off, and using partial pivoting, we find 
that the triangularized system is 


a 2 100 105 

0 3.66 133 137 |, 

0 0 —82.6 —82.7 
from which x; = 1.00, x. = 1.09, x; = 0.94; the exact solution vector should be 
[1.00, 1.00, 1.00]. 


106 CHAPTER 2: SOLVING SETS OF EQUATIONS 


2.5 


If we scale the values before reduction by dividing each row by the magnitude of 
the largest coefficient, so.that the system is 


F 0.03 er 
—0.01 
0.50 : * = " 


we get, with the same arithmetic precision, the tiangularized system 


050 1.00 -050 1.00 
0 0.05 0.99 1.04], 
0 0 1.82 1.82 


from which x; = 1.00, x. = 1.00, x, = 1.00. The reason for the improvement is that 
Tows are interchanged after scaling has been done. No interchanges are indicated im the 
unscaled equations. 

Whenever the coefficients in one column are widely different from those in another 
column, scaling is beneficial. When all values are about the same order of magnitude, 
scaling should be avoided, for the additional round-off error incurred during the scaling 
operation itself may adversely affect the accuracy. The usual way to scale is as we have 
done here, by dividing each row by the magnitude of the largest term. Some authorities 
recommend scaling so that the sum of the magnitudes of the coefficients in each row is 
the same. This is probably very slightly more economical to do in a computer program. 


OTHER LU METHODS 


With a modification of the elimination method in Section 2.4, we have other LU de- 
composition methods. The best known is named Crout reduction, or, after another dis- 
coverer, Cholesky reduction. In this method the matix of coefficients A is transformed 
into the product of two matrices L and U, where L is a lower-triangular and U is an 
upper-triangular matrix with ones on its main diagonal. An equivalent method transforms 
A into an LU pair in which L has ones on its diagonal. This is called Doalitile’s method. 
In this section we will concentrate on the Crout reduction method and refer to it simply 
as the LU decomposition method. 

We have previously seen that a matrix that has been tiangulanized combined with 
the lower-triangular matrix formed from the ratios used in its reduction form an LU pair. 
But LU pairs take many other forms. In fact, any matrix that has all diagonal elements 
nonzero can be written as a product of a lower-tnangular and an upper-trangular matrix 
in an infinity of ways. For example, 


-1 -1 


2 2 == Ri 

0-4 2/=/0 -4 ofo 1 4 

6-3 1) [6 o ajo o 1 
Ly U, 
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0 2 =) 
=10 1 OO -4 2 
3 0 1,0 0 
Ly U2 
1 0 2 =1 =I 
=10 2 Q@ =2 1 
3 0 0 0 4 
Ly U; 
= and so on. 


Of the entire set of LUs whose product equals matrix A, we choose the pair in which 
U has only ones on its diagonal, as in the first pair above. We get the rules for such an 
LU decomposition from the relationship that LU = A. In the case of a 4 * 4 matrix: 


M3 Ug 
U23 Ug 


1 U34 
1 


Multiplying the rows of L by the first column of U, we get €;; = a@j,, (2) = a2), 
€3; = a3), €4, = a4); the first column of L is the same as the first column of A, 
We now multiply the first row of L by the columns of U: 


€iMi2 = Ain, Oy ys = a3, tag = Qa, (2.2) 
from which 
a2 a3 Qa > 
Uy) = i uy = i 4 = ‘ (2.3) 
fu fn fn 


Thus the first row of U is determined, 

In this method we alternate between getting a column of L and a row of U, so we 
next get the equations for the second column of L by multiplying the rows of L by the 
second column of U: 

foi. + €22 = az, 
fai. + €32 = agp, (2.4) 
€4,uy2 + £42 = 42, 
which gives 
f9 = ao. — €n1t12, 


in 


€32 = G32 — €4My25 (2. 


£49 = gy — €4yy2- 
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Proceeding in the same fashion, the equations we need are 


Pa x3 — Ents a rg — Enittrg 

a a 2 ann > ~*~ 
fxs = G33 — €5yy3 — Carters, gg = a3 — Cgttys — Faattas, 
oe Oe €5 Uys — E525 

34 = ts . 


fas = Gas — Earttrg — Egnting — Cgatse. 


The general formula for getting elements of L and U corresponding to the coefficient 
matrix for m simultaneous equations can be written 


(For j = 1, the mle for & reduces to 


€a = ay. 


For i = 1, the rule for u reduces to 


The reason this method is popular in programs is that storage space may be econ- 
omized. There is no need to store the Zeros in either L or U, and the ones on the diagonal 
of U can also be omitted. (Since these values are always the same and are always known, 
it is redundant to record them.) One can then store the essential elements of U where the 
zeros appear in the L array. Examination of Eqs. (2.2) through (2.7) shows that, after 
any element of A, a;;, is once used, it never again appears in the equations. Hence its 
place in the original n x n array A can be used to store an element of either L or U. In 
other words, the A array can be transformed by the above equations and becomes 
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Because we can condense the L and U matrices into one array and store their elements 
in the space of A, this method is often called a compact scheme. 


.-1. 2 
A=/1 2 3). 
2 =2) <1 


Applying the equations for the €’s and u’s, we obtain 


EXAMPLE Consider the matrix A 


tn=3, Gy =1, (1 =% wpe my =? 
fn =2- (0-3) =}, & = -2- @(-3) = -}. 
3 - (1)(3) 
M3 = —— =1, by = -1 ~ (2(3) - (-3)) = -1. 
3 
3.0 «#0 bog oF 
L=|1 4 of, uv=lo 1° a], 
2 -$ -1 0 0 1 


If the quantities are written in the compact form as they are computed, we have 


3 -} 3]<@ 
tu=|1 4 1}-® 
2 -$ -1 
tt ft 
® ® ® 


The circled numbers show the order in which columns and rows of the new matrix 
are obtained. = 


Here is an algorithm for LU decomposition. It does not compute the L and L/ matrices 
in place, but sets them up as separate matrices. 
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DO FOR J = 2 ton step 1, 
DO FOR I =Tto n step 1, 
DO FOR K = 1 toJ — I step 1, 
Accumulate sum of L(,K) * U(K,J), in double precision. 


ENDDO. 
L(,J) = ACJ) — SUM. 

ENDDO. 

UGD = 1. 


DO FOR I = J + 1 ton step 1, 
DO FOR K = | toJ — 1 step 1, 
Accumulate sum of L(J,K) * U(K,I), in double precision. 
ENDDO. 
UG,D = [AU,D — SUM]/LG,D). 
ENDDO. 
ENDDO. 


The solution of the set of equations Ax = b is readily obtained with the L and U 
matrices. Once the coefficient matrix has been converted to its LU equivalent, we are 
prepared to find the solution to the set of equations that corresponds to any given right- 
hand-side vector b. The L matrix is really a record of the operations required to make 
the coefficient matrix A into the upper-triangular matrix U. We apply these same trans- 
formations to the RHS vector b, converting it to a new vector b’. If we augment b’ to 
U and back-substitute, the solution appears. 

The general equation for the reduction of b (it is exactly the same as the rule for 
forming the elements of U) is 


The equations for the back-substitution are 


For example, if 


3 — 
A=}1 
2 _ 
we get 
3 0 O 
L=|1 3 0] and 
2 -$ =i 
12 
For b = | 11}, 
2 
b= 
because 
j= 2=4 
- (14 
y= Ow 
3 
2 
b= 


Augmenting b’ to U and back-substituting, 


14 
0 1 
0.60 
we get 
x3 = 2, 
x) = 3 -1(2) 
x = 4-32) 


= 3, 


= eins 


1 
( 


Nw ps 
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2 — (2)(4) — (-3)3) 
——— 2 


-3)() = 3. 


Examination of the above operations reveals that we can get b’ by augmenting L 


with b and solving that triangular system (a kind of “forward” substitution): 


12 
11 
2 
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and 


uf 


bases 
bs = [11 — (1)(4)1/() = 3, 
bs = [2 — (204) — (-4)1/(-b = 2. 


We do not write the algorithm for reducing the b vector and back-substitution. This 
is left as an exercise for the student. 

A special advantage of these LU methods is that we can accumulate the sums in 
double precision. This gives us greater accuracy by just using one or two double-precision 
variables. We have indicated this in the algorithm. This is not easily done with the 
Gaussian elimination method of Section 2.4. Moreover, the LU method can be easily 
adapted to solve a system of new right-hand-side vectors with great economy of effort.* 
The number of arithmetic operations to get the solution corresponding to each b turns 
out to be exactly the same as to multiply an n * n matrix by an n-component- vector. 

Pivoting with the LU method is somewhat more complicated than with Gaussian 
elimination because we do not usually handle the right-hand-side vector simultaneously 
with our reduction of the A matrix. This means we must keep a record of any row 
interchanges made during the formation of L and U so that the elements of the right- 
hand-side vector can be similarly interchanged. We do the interchanges immediately after 
computing each column of L, choosing the value to appear on the diagonal so as to have 
the one of largest magnitude. We illustrate with an example: 


t- 2 J 
GivenA=]1 0 O}. 
30 J 


We will keep a record of row order in a vector: O = [1, 2, 3], representing the original 
ordering. 
The first column of L is 


wo 


we need to interchange rows 3 and 1. To keep track of this, we interchange the first and 
third elements of O, so O becomes [3, 2, 1]. 

Interchange the rows of A and compute the first row of the U matrix. (We use the 
compact scheme.) 


30 4 
1 0 O|, with O = [3, 2, 1]. 
(Neore 


*A numerical example that illustrates using the LU method with multiple right-hand sides will be found in 
Section 2.7. 
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0 


Now compute the second column of L; it is | 0]. 

2 
We must interchange again, the second row with the third, and O becomes [3, 1, 2]. 
Making the interchange of rows and computing the second row of U gives 


30 4 
0 2 4), with Oo = (3, 1, 2). 
10 0 
Completing the reduction, we find (,,; = 0 — (1)(4) = -1, giving 


v 


3 
LU =|9 


I 
3 
5), with O = 23, 1, 2). 
1 
1 0-4 


To solve the problem Ax = b, with b’ = (5, —1, —2], we rearrange the elements 
of b in the order given by O and compute b’: 


3. 0 01-2 -3 
Lib=|o 2 0} S|, w=] 3 
1 0 -t1-1 
so 
1 0 41-3 -1 
to 1s 5 we 
U'!b'=)0 1 31 3), givingx =| 2]. 
0 80 bed 1 


PATHOLOGY IN LINEAR SYSTEMS—SINGULAR MATRICES 


When a real physical situation is modeled by a set of linear equations, one can anticipate 
that the set of equations will have a solution that matches the values of the quantities in 
the physical problem, at least so far as the equations truly do represent it.* Because of 
round-off errors, the solution vector that is calculated may imperfectly predict the physical 
quantity, but there is assurance that a solution exists, at least in principle. Consequently, 
it must always be theoretically possible to avoid divisions by zero when the set of equations 
has a solution. 


*There are certain problems for which values of interest are determined from a set of equations that do not 
have a unique solution; these are called eigenvalue problems and are discussed in another chapter 
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An arbitrary set of equations may not have such a guaranteed solution, however. 
There are several such possible situations, which we term “pathological.” In each case, 
there is no unique solution to the set of equations. 

First, if the number of equations relating the variables is less than the number of 
unknowns, we certainly cannot solve for unique values of the unknown variables. It turns 
out, in this case, that there is an infinite set of solutions, for we may arrange the n 
equations with all but n of the variables on the right-hand sides, grouped with the constant 
terms. We may assign almost any desired values to these segregated variables (combining 
them with the constant terms) and then solve for the n remaining variables.* Assigning 
new values to the variables on the right-hand sides gives another set of values for the 
unknowns, and so on. For example, 


caabcoqemen 


Mi hy ty Se 
Rewrite as 
Xp = 24 — Xa, | 
Xy—- %=5-— x. 
If x3; = 0, 
*y ='6," a= 1, 
If x; = 1, 
x, = 5, X= 1 
ffx, = -1, 
x)=7, %2=1, and soon. 


A second situation where we might not expect a set of equations to have a solution 
is that in which the number of equations is greater than the number of unknowns. If there 
are n unknowns, we can normally find a subset of the equations that can be solved for 
the unknowns. There are two subcases to consider: 

If the remaining equations are satisfied by the values of the unknowns we have just 
determined (we would say these equations are consistent with the others), there exists a 
unique solution to the set of equations. Really, of course, there are not truly more equations 
than unknowns in this case; the extra equations are redundant. 

The other subcase is a pathological one. If the solution to the first n equations does 
not satisfy the remaining ones, the set is clearly inconsistent, and no solution exists that 
satisfies the system. 


*An important instance of this situation is the solution of a linear programming problem by the simplex method. 
In this method, the segregated variables are all assigned the value of zero. 
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Realizing that an equation may be redundant in the above situation makes us re- 
examine our more standard case of n equations in n unknowns. What if there is redun- 
dancy there? How can we recognize redundancy when it is present? In this example it is 
obvious: 


x+y =3, 2x + 2y = 6. 


The second equation is clearly redundant and contains no information not already 
given by the first. This system will then have an infinity of values for x and y; it is an 
example of fewer equations than unknowns. 

Inconsistancy may also be present: 


xX+y=3, wt =7. 


In this case there is no solution, 

If n X n systems do not have a unique solution, they have a (square) coefficient 
matrix that is called singular. If the coefficient matrix can be triangularized without having 
zeros on the diagonal (hence the set of equations has a solution), the matrix is said to be 
nonsingular. 

Larger systems may have redundancy or inconsistency even though it is not obvious 
ata glance, Even in a 3 * 3 system, it is not easy to tell 


xX, — 2x. + 3x3 =5, 
2x, + 4x9 - x3 = 7, 
=x, — 14x. + 1x3 = 2. 


Are these inconsistent or redundant? (In other words, is the coefficient matrix sin- 
gular?) Or do they have a unique solution? (That is, is the matrix nonsingular?) Is there 
a rule that we can apply, especially one that works for large systems? The answer is yes, 
there is a rule, or rather, there are several tests we can apply. The standard response from 
mathematics is to determine the rank of the coefficient matrix. If this value is less than 
n, the number of equations, then no unique solution exists; the equations are either 
inconsistent or one or more is redundant, depending on the right-hand-side values. 

But how does one determine the rank? One practical method is to triangularize by 
Gaussian elimination: If no zeros show up on the diagonal of the final triangularized 
coefficient matrix, the rank is equal to n (the matrix is said to be of full rank) and a 
unique solution exists, If, in spite of pivoting, one or more zeros occurs on the final 
diagonal, there is no unique solution. The set will be consistent (and have redundancy) 
if back-substitution gives (0/0) indeterminate forms. When one would need to divide a 
nonzero term by zero in back-substituting, inconsistency occurs. Let us apply this to our 
earlier 3 * 3 example: 


RN 
| 
NViuw 
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On reduction we get 


' 
———) 
or 
| 
oo 
| 
aw 


5 
Se) | | 
1 


We see that there is no solution and the equations are inconsistent. If the constant 
term in the third equation were | rather than 2, reduction gives 


2h HS 

Ose STi 

0 0 Ot 0 
This system is redundant. 


Another way to find whether a set of equations has a unique solution is to test the 
rows or columns of the coefficient matrix for linear dependency. Vectors are called linearly 
dependent if a linear combination of them can be found that equals the zero vector (one 
with all components equal to zero). (Of course the linear combination at + by + cz 
always equals zero if all the coefficients are zero; we rule out this possibility in our test 
for linear dependency. When a = b = c = O, we say we have the trivial case.) 

If the vectors are linearly independent, the only way a weighted sum of them can 
equal the zero vector is to weight each of them with a zero coefficient. 

Our singular 3 x 3 system has columns that form vectors that are linearly dependent: 


1 -2 3] 0 
(-10| 2]+@] 4) +@)-1} = Io]. 
-1 -14 ui] 0 


Similarly, the rows form linearly dependent vectors: 


(-3f1 -2 3)+@(2 4 -1)+(I-1 -14 1=(0 0 0}. 


In the general case, we say that vectors X,, >, ¥3, . . . , X, are linearly dependent 
if we can find scalar coefficients, a), a)... . , a,, (with not all the a; simultaneously 
zero), for which 


> a,x, = 0. (2.8) 


If the only linear combination of the x; that equals the zero vector requires that all the a; 
be zero, the set of vectors is called linearly independent, It follows that, if a set of vectors 
is linearly dependent, at least one of the vectors can be written as a linear combination 
of the others. If the set is linearly independent, none of the vectors can be written as a 
linear combination of the others. As a practical matter, we do not usually test the columns 
(or rows) for linear dependency to determine whether a matrix is singular. 
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If we are interested in determining the coefficients a,, a2, . . . , d, that appear in 
the linear combination of Eq. (2.8), it turns out that we have to solve a set of linear 
equations to obtain them, 

It is worthwhile to summarize the concepts and terminology of this section. The 
following lists of terms are all equivalent expressions, If a square matrix can be shown 
to have one property, it has all the others. 


Equivalent Properties of Singular or Nonsingular Matrices 


The matrix is singular. The matrix is nonsingular. 


A set of equations with these coeffi- A set of equations with these coeffi- 
cients has no unique solution. cents has a unique solution, 


Gaussian elimination cannot avoid a Gaussian ¢limination proceeds without 
zero on the diagonal. a zero on the diagonal. 


The rank of the matrix is less than n. The rank of the matrix equals n. 


The rows form linearly dependent The rows form linearly independent 
vectors. vectors. 


The columns form linearly dependent The columns form linearly independent 
vectors. vectors. 


In the next section we will consider two other properties of the matrix: its determinant 
and its inverse. This adds two more attributes to our lists: A singular matrix has a zero 
determinant and a nonsingular matrix has a nonzero determinant. A singular matrix has 
no inverse and a nonsingular matrix does have an inverse. 


DETERMINANTS AND MATRIX INVERSION 


You have perhaps wondered why there has been no reference so far in this chapter to the 
solution of linear equations by determinants (Cramer’s rule). The reason for this is that, 
except for systems of only two or three equations, the determinant method is too inefficient. 
For example, for a set of 10 simultaneous equations, about 70,000,000 multiplications 
and divisions are required if the usual method of expansion in terms of minors is used. 
A more efficient method of evaluating the determinants can reduce this to about 3000 
multiplications, but even this is inefficient compared to Gaussian elimination, which would 
require about 380. 

In fact, the evaluation of a determinant can perhaps best be done by adapting the 
Gaussian elimination procedure. Its utility derives from the fact that the determinant of 
a triangular matrix (either upper- or lower-triangular) is just the product of its diagonal 
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elements. This is easily seen, in the case of an upper-triangular matrix, by expansion in 
terms of minors of the first column at each step. For example, 


@y4> S12) G43: B44 


22 423 Ay 
0 ay a3 @ 
DO ae wel IO Gea) aa] tO LD 
poet 0 0 a 
0.” 108) (OP ag ae 
a3; a 
= ay1(a9|, Pa —0+0) 


= G41 497433444 — 0) = 11477033045. 


Adding a multiple of one row to another row of a matrix does not change the value 
of its determinant. The other row transformations change the value in predictable ways: 
interchanging two rows changes its sign and multiplying a row by a constant multiplies 
the value of the determinant by the same constant. If these changes are allowed for, using 
the procedure of Gaussian elimination to convert to upper-triangular is a simple way to 
evaluate the determinant. 


—©XAMPLE Find the value of the determinant by using elementary row transformations to make it 
upper-triangular. 


be See ee 
2 2 0 4_lo -6 4 -2}/_|o -6 4 -2 
3.0 -1 2 lo -12 eT 0 -3 -3 
1 2 2-3} lo -2 4 -6 [lo o § -% 

1 4-2 3 

~ H : = a = (1)(—6)(—3)(—8) = 144 

0 0 0 -8 


For programming, an easy and very efficient method for computing the determinant 
is to use steps 2 through 4 of the algorithm in Section 2.4. Then the determinant of the 
matrix is just the product of the diagonal elements, with a reversed sign if there were an 
odd number of row interchanges: +a), * aj) * . . . * Gyn, Where + is used if there were 
0 or an even number of row interchanges in steps 2 through 4 (otherwise we use —). 

Applying this algorithm to the example, but using row interchanges, we see 


3 3 0 so | 2 

4] _, | 0.667) 4 SOT 235) 
2 (0.333) (0.5) 3.167 9 —4.83 
3 (0.333) (0.5) (0.474) 3.78 


wre 
NONnA 
| 


1 9. 


Since 3 * 4 * 3.167 * 3.789 = 144 and there were 3 row interchanges in the process, 
we have the determinant = —144. 


While division of matrices is not defined, the matrix inverse gives the equivalent 
result. If the product of two square matrices is the identity matrix, the matrices are said 


EXAMPLE 
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to be inverses. If AB = 1, we write B = A~!; also A = B™!. Inverses commute on 
multiplication, which is not true for matrices in general: AB = BA = J, Not all square 
matrices have an inverse. Singular matrices do not have an inverse, and these are of 
extreme importance in connection with the coefficient matrix of a set of equations, as 
discussed earlier. 

The inverse of a matrix can be defined in terms of the matrix of the minors of its 
determinant, but this is not a useful way to find an inverse. The Gauss—Jordan technique 
can be adapted to provide a practical way to invert a matrix. The procedure is to augment 
the given matrix with the identity matrix of the same order. One then reduces the original 
matrix to the identity matrix by elementary row transformations, performing the same 
operations on the augmentation columns. When the identity matrix stands as the left half 
of the augmented matrix, the inverse of the original stands as the right half, It should be 
apparent that this is equivalent to solving a set of equations with n different right-hand 
sides; each of the right-hand sides is a unit vector, in which the position of the element 
whose value is unity changes from row | to row 2 to row 3... . to row n, 


Find the inverse of 


i =f 
A =13 0 
1 0 


i) 


Ne 


L=t 2 2 3 9 L-1 2 1 0 0 
3 0 1 0 1 O]>|0 3 -5 -3 1 0 
10 2 0 6 1 0 1 0-1 0 1 

wll =2 2 2 oO Offi <1 @ 1 § =§ 

>|0 1 0-1 0 t/5]o 1 0 -1 0 1 
o Oo-5 0 2 33 o o 1 o-b 3 

1 0 0 0 2 =} 
—]O 1 0 -1 0 J, 
o 0 1 0 - 2 


(UInterchange the third and second rows before eliminating from the third row. 
Divide the third row by —5 before eliminating from the first row. 


We confirm the fact that we have found the inverse by multiplication: 


1-1 2), 0 2 -t 1 0 0 
3 0 i-=t © T=10 1 Ole 
1 0 2). 0 -2 3} lo o 1 


However, it is more efficient to use the Gaussian elimination algorithm of Section 
2.4 by adding additional unit vectors to the augmented matrix. 
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Doing steps 2 through 4 gives us 
Leica 2 L «05.0 3 0 1 0 1 0 
3.0 1 0 1) > 0} => 4(0)333)- 1. 1.667 # 0.333 Oj. 
1 On 2 ADP ONL (0.333) (0) 1.667 0 —0.333 1 


Now applying back-substitution on the last three columns, we get 


(05333) - = 1.667 —1 0 tees 
(0.333) (0) 1.667 O -02 0.6 


| 3 0 1 0 04 -0.2 


where the last three columns store the inverse matrix. This method is actually more 
efficient than the Gauss—Jordan method, which takes about 33/2 versus 4n3/3 multi- 
plications and/or divisions to compute the inverse of a nonsingular matrix, Moreover, in 
the Gaussian elimination method, we have also found the LU matrix, which we can use 
to solve the system for other right-hand sides. 

The inverse of the coefficient matrix provides a way of solving the set of equations 
Ax = b because, when we multiply both sides of the relation by A~', we get 


The second equation follows because the product A~'A = /, the identity matrix, and 
Ix = x. If we know the inverse of A, we can solve the system for any right-hand side 
b simply by multiplying the b vector by A~!. This would seem like a good way to solve 
systems of equations, and one finds frequent references to it. 

If we care about the efficiency of our method of solving the equations, however, this 
is not the preferred method, because solving the system with the LU decomposition of 
A, and doing the equivalent of two back-substitutions, requires exactly the same effort 
as multiplying b by the matrix. We compare the efficiency of the two schemes, then, by 
comparing the work needed to get the inverse and that to get the LU equivalent. Getting 
the inverse is more work, because it is the equivalent of solving the system with n right- 
hand sides, while getting the LU is the equivalent of doing only the reduction to triangular 
form. 

Even though the inverse is not the most efficient way to solve a set of simultaneous 
equations, the inverse is very important for theoretical reasons and is essential to the 
understanding of many situations in applied mathematics. The use of the inverse concept 
and notation often simplifies the development of some fundamental relationships. We 
illustrate this by again considering the LU decomposition. 
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Find a pair of matrices such that LU = A. 
Then Ax = b can be written as 


Multiply both sides by L~!, 


(L7'L)Ux = L~'b, 


so Ux = L~'b because L7'L = 1. 


We see that Ux (which is a vector because the product of a matrix times a vector always 
yields a vector) is equal to the vector formed by L~'b. Call this vector b’. Then 


Ux=b', b' =L'b, or Lb’ =b. 


We can get the vector b’ by solving the system Lb’ = b. This is particularly easy to do 
because L is triangular; all we need to do is the back-substitution phase (actually a forward- 
substitution because L is lower-triangular). 

Once we have b’, we can solve for x from the system Ux = b’. This is also easy 
because U is triangular. 

Observe how using the concept and notation of inverses helps to clarify and prove 
the validity of the LU method. 


NORMS 


When we discuss multicomponent entities like matrices and vectors, we frequently need 
a way to express their magnitude—some measure of “bigness” or “smallness.” For ordi- 
nary numbers, the absolute value tells us how large the number is, but for a matrix there 
are many components, each of which may be large or small in magnitude. (We are not 
talking about the size of a matrix, meaning the number of elements it contains.) 

Any good measure of the magnitude of a matrix (the technical term is norm) must 
have four properties that are intuitively essential: 


1. The norm must always have a value greater than or equal to zero, and must be 
zero only when the matrix is the zero matrix (one with all elements equal to 
zero). 


ee 
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The norm must be multiplied by |k| if the matrix is multiplied by the scalar k. 

3. The norm of the sum of two matrices must not exceed the sum of the norms. 
The norm of the product of two matrices must not exceed the product of the | 
norms. . 


More formally, we can state these conditions, using ||A|| to represent the norm of 
matrix A: . 


\|A|| = 0 and ||A|| = 0 if and only if A = 0. 
All = || |All 


|A + Bi] =||Al] + |B. 
|AB|| = ||A]] ||B I. 


The third relationship is called the triangle inequality. The fourth is important when 
we deal with the product of matrices. 

For the special kind of matrices that we call vectors, our past experience can help 
us. For vectors in two- or three-space, the length satisfies all four requirements and is a 
good value to use for the norm of a vector. This norm is called the Euclidean norm, and 
is computed by Vx} + x3 + x3. 

We compute the Euclidean norm of vectors with more than three components by 
generalizing: 


This is not the only way to compute a vector norm, however. The sum of the absolute 
values of the x; can be used as a norm; the maximum value of the magnitudes of the x; 
will also serve. These three norms can be interrelated by defining the p-norm as 


Isl, = ( 3 Is) 


From this it is readily seen that 


n 
(lh = > |x;| = Sum of magnitudes; ; 
i 


EXAMPLE 
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I 


Euclidean norm; 


Ld 1/2 
[xh = (3 x) 


max |x,| = Maximum-magnitude norm. 


lsi=n 


Uke 


Which of these vector norms is best to use may depend on the problem. In most 
cases, satisfactory results are obtained with any of these measures of the “size” of a 
vector. 


Compute the I-, 2-, and s-norms of the vector x, if x = (1.25, 0.02, —5.15, 0). 
lll], = [1.25] + [0.02] + |—5.15| + |0| = 6.42; 
lx lly = [1.25)? + (0.02)? + (—5.15)? + (0)7]!? = 5.2996; 
|lxlke = |-5.15] = 5.15. 


The norms of a matrix are developed by a correspondence to vector norms. Matrix 
norms that correspond to the above, for matrix A, can be shown to be 


n 


|All, = max & |a,| = Maximum column sum; 
isjsn A 


\|A ll. = max > |a;,| = Maximum row sum. 
Isisn J=1 


The matrix norm ||Aj|) that corresponds to the 2-norm of a vector is not readily 
computed. It is related to the eigenvalues of the matrix (which we discuss in a later 
chapter). It sometimes has special utility because no other norm is smaller than this norm. 
It therefore provides the “tightest” measure of the magnitude of a matrix, but is also the 
most difficult to compute. This norm is also called the spectral norm. 

For an m * n matrix, we can paraphrase the Euclidean (also called Frobenius) norm* 
and write 


*Be alert to a situation that may be confusing: For a vector, the 2-norm is the same as the Euclidean norm, 
but for a matrix, the 2-norm is not the same as the Euclidean norm. 
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—X AMPLE Compute the Euclidean norms of A, B, and C, and the «-norms, given that 


lies ~ [Ot Onl: moa. 031 
[3 ‘i a= [05 ih and ees ek 


|All. = V25 + 81+ 441 = V111 = 10.53; |All. = 14. 
\|B\|. = V0.01 + 0 + 0.04 + 0.01 = V0.06 = 0.2449; —||BI|, = 0.3. 


Cl. = V0.04 + 0.01 + 0.01 + 0 = V0.06 = 0.2449; Cl. = 0.3. 


The results of our examples look quite reasonable; certainly A is “larger” than B or C. 
While B # C, both are equally “small.” The Euclidean norm is a good measure of the 
magnitude of a matrix. «= 


We see then that there are a number of ways that the norm of a matrix can be 
expresssed. Which way is preferred? There are certainly differences in their cost; for 
example, some will require more extensive arithmetic than others. The spectral norm is 
usually the most “expensive.” Which norm is best? The answer to this question depends 
in part on the use for the norm. In most instances, we want the norm that puts the smallest 
upper bound on the magnitude of the matrix. In this sense, the spectral norm is “best.” 
We observe, in the next example, that not all the norms give the same value for the 
magnitude of a matrix. 


EXAMPLE oe 
A=|-4 2 -4|]. 
a Me Se oP) 
\A||. = Euclidean norm = 15; 
I|Allx = 17; 
All, = 16; 


\|Allz = Spectral norm = 12. 
If a matrix is a diagonal matrix, all p-norms have the same value, however. ® 


Why are norms important? For one thing, they let us express the accuracy of the 
solution to a set of equations in quantitative terms by stating the norm of the error vector 
(the true solution minus the approximate solution vector). Norms are also used to study 
quantitatively the convergence of iterative methods of solving linear systems (which we 
will cover in a later section), 


2.9 CONDITION NUMBERS AND ERRORS IN SOLUTIONS 


When we solve a set of linear equations, Ax = b, we hope that the calculated vector x 
is a close representation of the true solution vector x. In the previous examples we have 
seen how round-off can make the computed solution differ from the exact solution, and 
several times the use of pivoting has been recommended as a way to minimize the effect. 
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The next example will demonstrate how much the accuracy can be improved by pivoting. 
We will exaggerate the effect by using only four-digit arithmetic (rounding to four digits 
after each operation). The same effects are observed with more precise arithmetic in large 
systems, 

Given Ax = b with 


—2.000 2.906 —5.387 


—0.002 4.000 = 4.000 
Age 
3.000 —4.031 3.112 


and 
b = (7.998, —4.481, —4.143)". 


After Gaussian elimination without pivoting, the triangularized matrix (augmented with 
the b vector) is 


0 —3.997 —4.005 8.002 


—0.002 4.000 4,000 7.998 
0 0 —10,00 0,000 


which gives the computed solution 
¥ = (—1496, 2.000, 0.000)". 


The exact solution is x = (1,000, 1.000, 1.000)", and we see that round-off errors 
(particularly obvious in the last row of the triangularized system) have given a completely 
incorrect result. In this example the very small value of a, has been the source of 
difficulty. 

If the equations are reordered to give 


3,000 -—4,031 -—3.112 —4.413 
A = | -0.002 4.000 4.000}, and b=| 7.998], 
—2.000 2,906 —5,387 —4.481 


and we repeat the four-digit computations, the triangularized system becomes 


0 3.997 3,998 7.995 


3.000 —4.031 3,112 —4,143 
0 0 —7,681 —7.681 


which gives ¥ = (1.000, 1.000, 1.000)”, whose error is nil. 

We conclude from the above that the algorithm that we employ can have a very 
significant effect on the accuracy. 

Unfortunately, the error due to round-off is sometimes large even with the best 
available algorithm, because the problem itself may be very sensitive to the effects of 
small errors. Consider this example: 

Given Ax = b, with 


3.02 —1.05 2.53 —1.61 
A=] 4.33 056. =1.78}; and b=| 7.23). 
1 


0.83 —0.54 
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If we solve the system by Gaussian elimination, using pivoting and carrying three 
digits rounded, the triangularized system is 


4.33" 0.56, “—1.78 7.23 
0 Sede geen —6.65 
0 -0 —0.00362 0.00962 


The computed solution is ¥ = (0.880, —2.35, —2.66)" while the true solution vector 
is x = (1, 2, —1)". The very small number on the diagonal in the third row is a sign of 
such inherent sensitivity to round-off. Such a system is called ill-conditioned. One strategy 
to use with ill-conditioned systems is greater precision in the arithmetic operations. For 
the above example, when six-digit computations are used, we get marked improvement: 


x = (0.9998, 1.9995, —1.002)7; 


but note that there is still an appreciable error. A large ill-conditioned system would 
require even more digits of accuracy than six if we wished to compute a solution anywhere 
near the exact answer. 

The occurrence of an unavoidable small value on the diagonal in an ill-conditioned 
system indicates that the determinant of the original coefficient matrix tends to be a small 
number. (We can make this statement because we can evaluate the determinant of a matrix 
by triangularizing it and then multiplying the elements on the diagonal.) Since a zero 
value for the determinant occurs only for a singular matrix, we can say that an ill- 
conditioned matrix is “nearly singular.” This is hardly a quantitative measure of the 
property, however, so we will look for a different way to express the degree of ill- 
conditioning. 

There are some important problems involving linear systems in which the coefficient 
matrix is always nearly singular, and hence ill-conditioned. One example is fitting a 
polynomial of relatively high degree to a set of points by the least-squares method. This 
is covered in another chapter. 

Another way to look at ill-conditioned systems is to examine the effect of small 
changes in the coefficients. In the above system, changing the aj, coefficient from 3.02 
to 3.00 gives a solution (exact) of 


x; = 0.05856, x, = 4.75637, x, = 1.07968. 


This property of ill-conditioned systems-—that their solution is extremely sensitive to 
small changes in the coefficients—explains why they are also so sensitive to round-off 
error. The small inaccuracies in values as the computation proceeds, caused by round- 
off errors, are equivalent to numbers we would encounter in using exact arithmetic on a 
problem with slightly altered coefficients. 

A more vivid conception of ill-conditioning is obtained by examining a system of 
two equations with two unknowns. 


Consider the system 
1.01 0.99][x] _ [2 
0.99 1.01 }Ly 2))7 


which has the obvious solution x = 1, y = 1. The set of equations can be represented 
by two straight lines that intersect at the point (1. 1). The lines are very nearly parallel, 
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as Fig. 2.2(a) shows. Suppose there is some uncertainty in the coefficients: The position 
of the lines might be anywhere in the fuzzy region as indicated by Fig. 2.2(b), and a 
corresponding amplified uncertainty in the point of intersection results. On the other hand, 
as indicated by Fig. 2.2(c), lines that are not almost parallel do not show an amplified 
uncertainty in their point of intersection when the position of the lines is not known 
precisely. If two lines are perpendicular, the uncertainty in the point of intersection is no 
greater than the uncertainty in the lines themselves. 

One can consider linearly dependent vectors of order 2 as representing parallel lines. 
The analogy of a consistent set of right-hand sides, with a corresponding redundancy in 
the set of equations, is two coincident straight lines. There is an infinite set of solutions 
because the lines “intersect” everywhere. 


Figure 2.2 
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Two noncoincident parallel lines are analogous to an inconsistent set of equations 
with a singular coefficieft’matrix; there is no solution and no intersection of the lines. 
The visual picture of linear dependence in three-dimensional space would involve parallel 
planes. We need to think of parallel hyperplanes in higher-dimensional spaces. 

A simple way to detect ill-conditioning would be to make a deliberate change in 
some of the coefficients and determine the degree of change in the solution. An even 
simpler way is to examine the values on the diagonal of the triangularized system, but 
a computer subroutine may not always exhibit these values. When ill-conditioning is 
suspected, one might compare the solution obtained using single precision with that when 
double precision is used. Then both the presence of ill-conditioning is observed and a 
better approximation of the solution is obtained. 

In some situations, one can combat ill-conditioning by transforming the problem into 
an equivalent set of equations that are not ill-conditioned. The efficiency of this scheme 
is related to the relative amount of computation required for the transformation, compared 
to the cost of doing the calculations in higher precision.* 

An interesting phenomenon of an ill-conditioned system is that we cannot test for 
the accuracy of the computed solution merely by substituting it into the equations to see 
whether the right-hand sides are reproduced. Consider again the ill-conditioned example 
we have previously examined: 


3.02 -1.05 2.53 Sata! 
A=] 433 0.56 -1.78], b=| 7.23). 
—0.83  —0.54 1.47 —3.38 


If we compute the vector Ax, using the exact solution x = (1, 2, 1)’, we of course get 
Ax = (—1.61, 7.23, —3.38)? = b. 
But if we substitute a clearly erroneous vector 
x = (0.880, —2.35, —2.66)7, 


we get AX = (—1.6047, 7.2292, —3.3716)", which is very close to b. 
We define the residual of a solution vector as the difference between b and Ax, where 


X is the computed solution: 


r=b-— Ax, 


Our example shows that the norm of r is not a good measure of the norm of the error 
vector (e = x — xX) for an ill-conditioned system. 

Because the degree of ill-condition of the coefficient matrix is so important in deter- 
mining the magnitude of round-off effects, it is valuable to have a quantitative measure. 
The condition number is normally defined as the product of two matrix norms: 


Condition(A) = ||A\j \|A~'||. 


*Double precision is not required throughout the computations. When the system is solved through LU decom- 


position, the accumulation of inner products in double precision is sufficient. 
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Unfortunately this is not an inexpensive quantity to compute, for it requires us to 
invert A. Because inverting a matrix amounts to solving a linear system (solving it with 
n different right-hand sides, actually), and the computed solution for an ill-conditioned 
system may be inexact, we will not compute A~! very accurately. This suggests that the 
condition number will not be computed very exactly either. Ordinarily this causes no 
great difficulty; if the conditon number is large, we know we are in serious trouble. 
Observe that condition numbers will always be at least unity, which corresponds to the 
condition number of the identity matrix. 

For our previous example, we have 


3.02 —1.05 2:53: 5.661 al RPA =18:55 
A=] 4.33 0.56 —1.78],  A~! =| 200.5 —268.3 —669.9 
-0,83 —0.54 1.47 76.85  —102.6 255.9 


Using matrix %-norms, we find that the condition number is 


All ||A~!|| = (6.67)(1138.7) = 7595 


The elements of A~! will be large relative to the elements of A when A is ill- 
conditioned. However, this can also be true when the elements of A are small, even in 
the absence of ill-conditioning, Multiplying the two norms has a normalizing effect, so 
the condition number is large only for an ill-conditioned system. 

The condition number lets us relate the magnitude of the error in the computed 
solution to the magnitude of the residual. We use norms to express the magnitude of the 
vectors. 

Let e = x — X¥, where x is the exact solution to Ax = b and ¥ is an approximate 
solution, Let r = b — AX, the residual. Since Ax = b, we have 


r= b — Ax = Ax — AX = A(x — ¥) = Ae. (2.10) 
Hence, 
e=A!r. 


Taking norms and recalling Eq. (2.9), line 4, for a product, we write 


Well = A-"I U. 2.11) 
From r = Ae, we also have ||r|| = ||Ajj |e, which combines with Eq. (2.11) to give 
Mrll =U ip 
= |lell = A" Ir. (2.12) 


x|| = |A~*|| [ell (2.13) 
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Taking Eqs. (2.12) and (2.13) together, we reach a most important relationship: 


1 rll — Hell iirll 


TATA? Wet = xy = MAAN Gy | 


Wall’ 


(Condition no.) ae 


1 Mrll — Hell — 
(Condition no.) ||b|) ~ {jx|| — 


Equation (2.14) shows that the relative error in the computed solution vector X can be 
as great as the relative residual multiplied by the condition number. Of course it can also 
be as small as the relative residual divided by the condition number. Therefore, when 
the condition number is large, the residual gives little information about the accuracy of 
&. Conversely, when the condition number is near unity, the relative residual is a good 
measure of the relative error of x. 

When we solve a linear system, we are normally doing so to determine values for 
a physical system for which the set of equations is a model. We use the measured values 
of the parameters of the physical system to evaluate the coefficients of the equations, so 
we expect these coefficients to be known only as precisely as the measurements. When 
these are in error, the solution of the equations will reflect these errors. We have already 
seen that an ill-conditioned system is extremely sensitive to small changes in the coef- 
ficients. The condition number lets us relate the change in the solution vector to such 
errors in the coefficients of the set of equations Ax = 

Assume that the errors in measuring the parameters cause errors in the coefficients 
of A so that the actual set of equations being solved is (A + E)¥ = b, where x represents 
the solution of the perturbed system and A represents the true (but unknown) coefficients. 
We let A = A + E represent the perturbed coefficient matrix. We desire to know how 
large x — X is. 

Using Ax = b and Ax = b, we can write 


x=A-'b = A-\(At) = AA + A— ADE 
= [1+ AA — Ake 
=F + A714 - Ade. 
Since A — A = E, we have 


x—xX=A"'Ex. 
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Taking norms, we get 


E\l 


ll = 5) < ATM MEN Well = A MAT jay ell 


so that 


El 


= (Condition no.) Wal’ 


x = x]] 
x 


This says that the error of the solution relative to the norm of the computed solution 
can be as large as the relative error in the coefficients of A multiplied by the condition 
number. The net effect is that, if the coefficients of A are known to only four-digit 
precision and the condition number is 1000, the computed vector x may have only one 
digit of accuracy. 

When the solution to the system Ax = b has been computed, and, because of round- 
off error, we obtain the approximate solution vector X, it is possible to apply iterative 
improvement to correct ¥ so that it more closely agrees with x. Define e = x — x. Define 
r = b — AX. As shown above [Eq. (2.10)], 


Ae=r. (2.16) 


If we could solve this equation for e, we could apply this as a correction to ¥. Further- 
more, if |/e||/||*|| is small, it means that ¥ should be close to x, In fact, if the value of 
\le||/||¥|| is 10°?, we know that ¥ is probably correct to p digits. 

The process of iterative improvement is based on solving Eq. (2.16). Of course this 
is also subject to the same round-off error as the original solution of the system for x, 
so we actually get @, an approximation to the true error vector. Even so, unless the system 
is so ill-conditioned that é is not a reasonable approximation to e, we will get an improved 
estimate of x from ¥ + @. One special caution is important to observe: The computation 
of the residual vector r must be as precise as possible. One always uses double-precision 
arithmetic; otherwise iterative improvement will not be successful. An example will make 
this clear. 


Given 
4.23 -1.06 2.11 5.28 
A= |]-2.53 6.77 0.98], b=] 5.22], 
1.85 -—2.11 —2.32. —2.58 
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whose true solution is 


= 1.000 
x =| 1.000}. 
1.000. 


If three-digit chopped arithmetic is used, the approximate solution vector is ¥ = (0.991, 
0.997, 1.000)7. Using double precision, we compute A¥ and the residual as 


y 5.24511 0.0348 
Ax =| 5.22246], —_r = | -0.00246]. 
—2.59032 0.0103 


We now solve Aé = r, again using three-digit precision, and get 


0.00822 
€=| 0.00300 5 
—0.00000757 


Finally, correcting ¥ with x + @ gives almost exactly the correct solution: 


In the general case, the iterations are repeated until the corrections are negligible. 
Since we want to make the solution of Eq. (2.16) as economical as possible, we should 
use an LU method to solve the original system and apply the LU to Eq. (2.16). 


2.10 ITERATIVE METHODS 


As opposed to the direct method of solving a set of linear equations by elimination, we 
now discuss iterative methods. In certain cases, these methods are preferred over the 
direct methods—when the coefficient matrix is sparse (has many zeros) they may be more 
rapid.* They may be more economical in memory requirements of a computer. For hand 
computation they have the distinct advantage that they are self-correcting if an error is 
made; they may sometimes be used to reduce round-off error in the solutions computed 
by direct methods, as discussed above. They can also be applied to sets of nonlinear 
equations. 
We illustrate the method by a simple example: 


8, + m- eB= 8 
2x, + x2 + 9x3= 12, (2.17) 
X; — 7x2 + 2x; = —4. 


*When the occurrence of zeros follows some easy pattern, elimination methods can take advantage of this. 
See the program for a tridiagonal system at the end of the chapter. 
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The solution is x, = 1, x; = 1, x; = 1. We begin our iterative scheme by solving each 
equation for one of the variables, choosing, when possible, to solve for the variable with 
largest coefficient: 


x =1 — 0.125x, + 0.125x; (from first equation), 
Xx, = 0.571 + 0.143x, + 0.286x3 (from third equation), (2.18) 
x3 = 1.333 — 0.222x, — O11 Ly (from second equation) 


We begin with some initial approximation to the value of the variables. (Each com- 
ponent might be taken equal to zero if no better initial estimates are at hand.) Substituting 
these into the right-hand sides of the set of equations generates new approximations that 
are closer to the true value. The new values are substituted into the right-hand sides to 
generate a second approximation, and the process is repeated until successive values of 
each of the variables are sufficiently alike. For the set of equations given above we get 


Successive Estimates of Solution (Jacobi Method) 


First Second Third Fourth Fifth Sixth Seventh Eighth 


x 0 1,000 1,095 0,995 0,993 1,002 1,001 1,000 
% 0 0.571 1.095 1.026 0.990 0.998 1.001 1.000 
% 0 1.333 1.048 0.969 1.000 1.004 1.001 1.000 


Note that this method is exactly the same as the method of iteration for a single equation 
that was discussed in Chapter |, but applied to a set of equations, for we may write Eq 
(2.18) in the form 


xt) = Gr = b! — Bx), (2.19) 


which is identical in form to x,.,; = g(x,) as used in Chapter |. 

In the present context, of course, x'”’ and x'"*") refer to the nth and (n + I)st iterates 
of a vector rather than a simple variable, and G is a linear transformation rather than a 
nonlinear function. For the example above, we restate Eq. (2.17) in matrix form after 
interchanging equations 2 and 3: 


8 1 -1][x 8 
1-7 2\l_l=|-4],  ax=o (2.20) 
2 1. 9|[x, 12 


Now, letA = L + D + U, where 


000 § O 9 0 1 -l 
L=/1 0 0), D=|0 -7 O}], U=|0 OO 2). 
2 1-0 0 Oo 9 0 0 9O 


134 


CHAPTER 2: SOLVING SETS OF EQUATIONS 


Then Eq. (2.20) above can be rewritten as 


Ax=(L+D+U)x=b,or 
Dx = —(L + U) x + b, which gives 
x=—-D-\L + U)x + Dob. 


From this we have, identifying x on the left as the new iterate, 
xO) = —p7! (L + U) x™ + Db. (2.21) 


In Eq. (2.19) we see that 


1.000 
b' = D"'b = }0.571}, 


1.333 
0 0.124 —-0.125 
B=D-\L+U)=|-0.143 0 —0.286 |. 
0.222 0.111 0 


The procedure we have just described is known as the Jacobi method, also called 
“the method of simultaneous displacements,” because each of the equations is simulta- 
neously changed by using the most recent set of x-values. 

We can write the algorithm for the Jacobi iterative method as follows: 


Algorithm for Jacobi Iteration 


To solve a system of N linear equations, rearrange the rows so that the diagonal 
elements have magnitudes as large as possible relative to the magnitudes of other 
coefficients in the same row. Define the rearranged system as Ax = b. Beginning 
with an initial approximation to the solution vector, x“, compute each com- 


ponent of x*) fori = 1,2,...,N, by 
b N 
. a; 
(ett) = Siro ij (nm) 
|e hse > xm, n=1;2, 
it = J=1, Sit 
jti 
A sufficient condition for convergence is that 
N 
laal> 2 lay §= 1,2)... 5M 
7-1, 
Jjti 


When this is true, x” will converge to the solution no matter what initial vector 
is used. 


Actually, the x-values of the next trial are not all calculated “simultaneously” when 
we perform the Jacobi method. In the above, we calculated the second estimate of x, 
before we did the x3, and new values of both x, and x, were available before we improved 
the value of x3. In nearly all cases the new values are better than the old, and should be 
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used in preference to the poorer values.* When this is done, the method is known by the 
name Gauss-Seidel. In this method our first step is to rearrange the set of equations by 
solving each equation for one of the variables in terms of the others, exactly as we have 
done above in the Jacobi method. One then proceeds to improve each x-value in turn, 
using always the most recent approximations to the values of the other variables. The 
rate of convergence is more rapid, as shown by reworking the same example as earlier 
(Egs. (2.18), rearranged form of (2.17)): 


Successive Estimates of Solution (Gauss—Seidel Method) 


First Second Third Fourth Fifth Sixth 
x) 0 1,000 1.041 0.997 1.001 1,000 
Xz 0 0.714 1.014 0.996 1,000 1,000 
X3 0 1.032 0.990 1.002 1.000 1.000 


These values were computed by using Eq. (2.18) in the following way: 


xy) = | —0.125x9? + 0.125x4, 
xg") = 0.571 + 0.143x("* + 0.286%", 
xf = 1,333 - 0.222x(") — O11 1x), 

beginning with x!) = (0, 0, 0)". 


An algorithm for Gauss—Seidel iteration is as follows: 


Algorithm for Gauss-Seidel Iteration 


To solve a system of N linear equations, rearrange the rows so that the diagonal 
elements have magnitudes as large as possible relative to the magnitudes of other 
coefficients in the same row. Define the rearranged system as Ax = b. Beginning 
with an initial approximation to the solution vector, x‘), compute each com- 
ponent of xt), fori = 1,2,...,N, by 


A sufficient condition for convergence is that 


N 
jai > 2 Jaye) GD nae aN 
jt 


When this is true, x” will converge to the solution no matter what initial vector 
is used. 


*There may be times when this obvious move is not beneficial, due to cancellation of errors, but in general 
this is good strategy in numerical analysis. 
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The Gauss-Seidel method will generally converge if the Jacobi method converges, 
and will do so more rapidly. However, the Jacobi method might still be the preferred 
method if we were running our program on parallel processors, since all n equations 
could be solved simultaneously at each iteration. In fact, other routines once thought 
“obsolete” may have new life due to parallel processing. | 

The matrix formulation for the Gauss—Seidel method is almost like the one given in | 
Eg. (2.21). Here Ax = b is rewritten as 


(L + D)x = —Ux + b. (2.22) 


From this we have 


(Compare this with Eq. (2.21).) 


0 0 0 
D'L=|-0.143 0 0), and 
0.22: 


2 O.111 0. 
0 0.125 0.125 
D'u=|0 0 —0.286 |. 
en) 0 


The usefulness of the matrix notation will be more apparent in Chapter 6, when we study 
the eigenvalues of the matrices | D~'(L + U)) in Eq. (2.21), and |(L + D)~! in Eq. 
(2.22). The eigenvalues of these matrices explain the rates of convergence of both 
methods. 

These iteration methods will not converge for all sets of equations, nor for all possible 
rearrangements of the equations. When the equations can be ordered so that each diagonal 
entry is larger in magnitude than the sum of the magnitudes of the other coefficients in 
that row (such a system is called diagonally dominant), the iteration will converge for 
any starting values. This is easy to visualize, because all the equations can be put in the 
form 


b, > 
x= = fy, = Gi2y, yey (2.24) 
ay 


The error in the next value of x; will be the sum of the errors in all other x’s multiplied 
by the coefficients of Eq. (2.24), and if the sum of the magnitudes of the coefficients is 
less than unity, the error will decrease as the iteration proceeds. The preceding conver- 
gence condition is a sufficient condition only; that is, if the condition holds, the system 
always converges, but sometimes the system converges even if the condition is violated. 

The speed with which the iterations converge is obviously related to the degree of 
dominance of the diagonal terms, for the coefficients in Eq. (2.24) are then smaller and 
x; is less affected by the errors in the other components. When the initial approximation 
is close to the solution vector, relatively few iterations are needed to get an acceptable 
solution. 


2.11 
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RELAXATION METHOD 


There is an iteration method that is more rapidly convergent than Gauss—Seidel and that 
can be used to advantage for hand calculations. It is unfortunately not well adapted to 
computer application. The method is due to a British engineer, Richard Southwell, and 
has been applied to a wide variety of problems, (Allen (1954) is an excellent reference.) 
We discuss the method because of its historical importance and because it leads to an 
important acceleration technique called overrelaxation. 

If we consider the Gauss-Seidel scheme, we realize that the order in which the 
equations are used is important. We should improve that x that is most in error, since, 
in the rearranged form, that variable does not appear on the right, and hence its own 
error will not affect the next iterate. By using that equation, then, we introduce lesser 
errors into the computation of the next iterate. The method of relaxation is a scheme that 
permits one to select the best equation to be used for maximum rate of convergence. 

We illustrate the method by the same example we solved in Section 2.10, The original 
equations are 


8x, t+ xy - x4, = 8, 
2x, + xy + 9x, = 12, (2.25) 
x — 7x2 + 2x3 = —4. 


We again begin by a rearrangement of the equations, but different from that for the 
Gauss-Seidel or Jacobi methods, We transpose all the terms to one side, and then divide 
by the negative of the largest coefficient. Equations (2.25) become 


=x, — 0.125x, + 0.125x, +1 = 0, 
x, + 0.1L — x; + 1,333 = 0, (2.26) 
0.143x, — xy + 0.286x, + 0.571 = 0. 


If we begin with some initial set of values, and substitute in Eqs, (2.26), the equations 
will not be satisfied (unless, by chance, we have stumbled onto the solution); the left 
sides will not be zero, but some other value that we call the residual and denote by R,. 
It is also convenient to reorder the equation so the —1 coefficients are on the diagonal. 
Equations (2.26) become, with these rearrangements, 


—x, — 0.125x) + 0.125%; +1 =R), 
0.143x, — xy + 0.286x3 + 0.571 = Ry 
—0.222x, — 0.1 Lxy — xy + 1.333 = Ry. 


For example, with x; = 0, x) = 0, x; = 0, we have 
R, =1, 2 = 0.571,  R3 = 1.333. 


The largest residual in magnitude, R3, tells us that the third equation is most in error and 
should be improved first. The method gets its name “relaxation” from the fact that we 
make a change in x; to relax R, (the greatest residual) so as to make it zero, Observing 
the coefficients of the various equations, we see that increasing the value of x, by one, 
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say, will decrease R3 by one, will increase R, by 0.125, and increase R> by 0.286. To 
change R; from its initiakvalue of 1.333 to zero, we should increase x3 by that same 
amount. 

We then select the new residual of greatest magnitude, and relax it to zero. We 
continue until all residuals are zero, and when this is true, the values of the x’s will be 
at the exact solution. In implementing this method, there are some modifications that 
make the work easier. We illustrate in Fig. 2.3. 

We make three double columns, one for each variable and for the residual of the 
equation in which that variable appears with —1 coefficient. The initial x values and the 


Eq. No. x x x3 
I zal —0.125 0.125 
I 0.143 =i 0,286 
Il —-0.222 -0.111 al 
x R, x Ry x3 Rin 
0 1000 0 S71 0 1333 
+167 +381 
Hi67 952 +1333 g 
+167 —259 
+1167 | g LHS —259 
—140 —124 
—l40 +1119 g —383 
— 48 —109 
—189 — 169 —383 vi) 
=327 + 42 
—189 g -B6 a2 
17 iS 
vf —136 8 aT 
= rh + 16 
24 Ww +57 B 
+ 3 a 
+24 v1) w rf 
a. = 2 
Zz +19 g -7 
— =? 
3 =Z ah 6 
0 ee! 
=~ a =Z I 
0 0 
g 2 g x 
0 0 +1 0 
999 1000 1001 
Check residuals: 1 = 0 


Figure 2.3 Solving a set of linear equations by relaxation. 


Table 2.1 
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initial residuals are entered as the first row of the table. It is convenient to work entirely 
with integers by multiplying the initial x-values and residuals by 1000, and then to scale 
down the solution by dividing by 1000 at the end of the computations. We avoid fractions; 
if a fractional change in a variable is needed to relax to zero we only relax to near zero. 

In Fig. 2.3, we set down the increments to the x’s but record the cumulative effect 
on the residuals. (The old values of the residuals are crossed out when replaced by a new 
value.) When the residuals are zero, we add the various increments to the initial value 
to get the final value. In this example, round-off errors cause an error of one in the third 
decimal. 

It is important to make a final check by recomputing residuals at the end of the 
calculation to check for mistakes in arithmetic. The method is not usually programmed 
because searching on the computer for the largest residual is slow, and adds enough 
execution time that the acceleration gives no net benefit. The search can be done rapidly 
by scanning the residuals in a hand calculation, however. 

Southwell and his coworkers observed, for many situations, that relaxing the residuals 
to zero was less efficient than relaxing beyond zero (overrelaxing) or relaxing short of 
zero (underrelaxing). The reason this is an improved strategy is that a zero residual 
doesn’t stay zero; relaxing the residual of another equation affects the first residual, so 
it is appropriate to anticipate and allow for this by an appropriate under- or overrelaxation. 

Table 2.1 shows that a significant improvement in the speed of convergence is obtained 
if R, is underrelaxed by 10% and R; is underrelaxed by 25%. Unfortunately the optimum 


Accelerated solution of linear equations by relaxation 


Eq. No. x X> xy 
I -1 0.125 0.125 
tf 0.143 = 0.286 
Il 0.222 -O.111 ad | 
R, x2 Ry x3 Rin 
0 1000 0 571 0 1333 
+125 +286 
1125 857 +1000 333 
+144 —225 
+1013 12 1001 108 
—125 -111 
=13 +1001 0 =3 
=2 +3 
-12 =f =—2 0 
+0 +0 
-1 =2 0 i) 
+0 +0 
=) 0 0 0 


1000 
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degree of under- or overrelaxation is not easily determined. In many problems, acceleration 
is obtained by overrelaximg rather than underrelaxing. 

Even though Southwell’s relaxation method is not often used today, there is one 
aspect of it that has influence on the iterative solution of linear equations by computer. | 
In using the Gauss-Seidel method, we can speed up the convergence by “overrelaxation,” | 
that is, by making the residuals go to the other side of zero instead of just relaxing to | 
zero as in the first example. We can apply this to Gauss—Seidel iteration by modifying 
the algorithm. 

The standard relationship for Gauss—Seidel iteration for the set of equations Ax = 
b, for variable x;, can be written 


where the superscript (k + 1) indicates that this is the (k + 1)st iterate. On the right side 
we use the most recent estimates of the x;, which will be either xP or xft*), 
An algebraically equivalent form for Eq. (2.27) is 


+ 1 
xD = xO +— 


because x\*) is both added to and subtracted from the right side. In this form, we see that 


Gauss-Seidel and Southwell’s relaxation can have identical arithmetic: The term we add 
to x to get x\**! is exactly the increment that relaxes the residual to zero. (Of course, 
we apply the relaxation to the .x;’s in a different sequence in the two methods.) Overre- 
laxation can be applied to Gauss-Seidel if we will add to x“) some multiple of the second 
term. It can be shown that this multiple should never be more than 2 in magnitude (to 
avoid divergence), and the optimum overrelaxation factor lies between 1.0 and 2.0. Our 
iteration equations take this form, where w is the overrelaxation factor: 


ft n 

(k+1) — .@) 4 k+1 k) 

xt = xf +7 (a- ayxt*D — > ax). (2.28) 
e = 


ii 


Table 2.2 shows how the convergence rate is influenced by the value of w for the system 


—4 1 1 


1 1 
1] 
11-4 1 if 
-4] oli 


Table 2.2 


2.12 
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starting with an initial estimate of x = 0. The exact solution is 


Reh gee =. gesL.. m=). 


Acceleration of convergence of Gauss-Seidel iteration 
Se EE 


w, the overrelaxation Number of iterations 
factor to reach error < 1 x 1075 
1.0 24 
a 18 
1.2 13 
13 11 — Minimum 
1.4 14 of iterations 
1.5 18 
1.6 24 
17 35 
1.8 55 
1.9 100+ 
We see that the optimum value for the overrelaxation factor is about w = 1.3 for this 


example. The optimum value will vary between between 1.0 and 2.0 depending on the 
size of the coefficient matrix and the values of the coefficients. Overrelaxation is considered 
further in Chapter 7 in connection with methods to solve partial-differential equations. 


SYSTEMS OF NONLINEAR EQUATIONS 


As mentioned previously, the problem of finding the solution of a set of nonlinear equations 
is much more difficult than for linear equations, (In fact, some sets have no real solutions.) 
Consider the example of a pair of nonlinear equations: 


ve+y=4, 


e+y =1. 


(2.29) 


Graphically, the solution to this system is represented by the intersections of the 
circle x? + y? = 4 with the curve y = 1 — e*. Figure 2.4 shows that these are near 
(—1.8, 0.8) and (1, —1.7). We can use the method of iteration to improve these approx- 
imations. Just as in Section 1.6, we rearrange both equations to a form of the pattern 
x = f(x, y), y = g(x, y), and use the method of iteration on each equation in turn. 
Under proper conditions, these will converge. For example, if we rearrange Eqs. (2.29) 
in the form 


(— sign for leftmost root) 


142 CHAPTER 2: SOLVING SETS OF EQUATIONS 


(-18, 0.8) 


Figure 2.4 
we get the following successive values, beginning with y, = 0.8 in the first equation: 


x-values: —1.83 —-1.815 —1.8163 —1.8162 
vad So ~ 7 S 7 \ 
y-values: 0.8 0.84 0.8372 0.8374 0.8374. 


When we begin at y = —1.7 to find the root to the right of the ongin, we get 


x 1.05 0.743 1.669 Imaginary value. 
an" a \ ‘a \ al 
ye? —-1.857 —1.102 —4.307 


The equations diverge! (Beginning with x = 1.0 in the second equation is no help. 
This also diverges.) However, with a different rearrangement of the original equations, 
such as 


» 
ll 


(— sign for rightmost root) (2.30) 


Ng 
il 


5 0.993 1.006 1.0038 1.0042 1.0042. 


5 tis! oe —1.736 — 1.7286 —1.7299 —1.7296 


The pair of rearranged equations in (2.30) converges. 
Some of the difficulties with sets of nonlinear equations are apparent from this simple 
example. If there are more than two equations in the system, finding a convergent form 
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of the equations is increasingly difficult. A criterion for convergence (sufficiency condition 
only) is as follows: 


The set of equations 
X=f@Ww2%.-), Y= EY.) ZAG 2.6) 


will converge if, in an interval about the root, 


fl + lhl + lAl+... <1, 


axl + lgyl + lal +-- 2 <1, 
Jag] + [Al + [Al +... <4, 
<1. 


In the above inequalities, the subscript notation designates partial derivatives. Com- 
puting all the partial derivatives and knowing where the root is are major problems. 
Getting starting values for the multidimensioned system is also correspondingly difficult. 

Newton’s method can also be applied to systems, We begin with the forms 


f(x, y) = 0, 
a(x, y) = 0. 


Let x = r, y = s be a root, and expand both functions as a Taylor series about the 
point (x;, y,) in terms of (r — x;), (s — y,), where (x;, y;) is a point near the root: 


flr, s) = 0 = f(x, ») + AO, yr — x) 
+ fli. YS — y) Hoe 2.31) 


alr, 5) = 0 = g(x; yj) + ax(x, yr — xi) 
t+ gylxi, ys — yy) +o 


Truncating the series gives 
0} _ fF, a +(e vy) AQ. A ame 
= ; E ‘ 2.32 
[°| ae y) Bxl%is Y)  By(%m WILS — Ye oe 


We rewrite this to solve as the system of equations 


[ yd flrs a | _ = | 
Bit» ¥) By YD ILAY; B(x. yd] 


where Ax; = r — x,, and Ay; = s — yj. 
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We solve (2.33) by Gaussian elimination and then, if we set 


we get an improved estimate of the root, (r, s). We repeat the above process with i 
replaced by i + 1 until fand g are close to 0. The extension to more than two simultaneous 
equations is straightforward. Program NLSYST is an implementation. 

We illustrate by repeating the previous example: 


f(x,y) =4 —x?-y? =0, 
ax, y) = 1—e*-—y=0. 
The partials are 
Z=-& f=-2, 
&=—e, gy =n. 


Beginning at x) = 1, yo = —1.7, we solve the equations 
a” 3.4|| Ax] _ _}| 0.1100 
ES Riles i ne , C3) 
to get Axg = 0.0043, Ayg = —0.0298. This then gives us x; = 1.0043, y) = —1.7298. 


The results already agree with the true value of the root within two in the fourth decimal 
place. Repeating the process once more produces x, = 1.004169, y) = —1.729637. The 
function values at (x, y) are approximately —0.0000001 and —0.00000001 respectively. 

We can write Newton’s method for a system of n equations by expanding on 


Eq. (2.31): 


0= (fA) + Alt — x0) + dys — yo) + 
0 = (fy) + (Addr — x0) + (Ads — Yo) + ADAt 
0 = (fs) + (dlr — x0) + U)(5 — Yo) + (Lt 


= Si) + Sst — X0) + fds — Yo) + Pkt 


In these equations, each function is evaluated at the approximate root (%p, Yo, Zo, - - -)- 
Note that we have reduced the problem from solving a set of n nonlinear equations to 
solving a set of n linear equations. The unknowns are the improvements in each estimated 
variable (r — x9), (s — Yo), (t — 29). - -- - 
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Then Eq. (2.33) becomes 


te Sip Fis, aoe Ax; fi 
fax Ary fas + + +} JAY: f 
fax fay faz + + | JAF: fh 
by a e | els (2.37) 


tn 


Sux 


evaluated at (x,, y;, 2)... .). Solving this, we compute 


Xin, = + AX, Via, = Ye + Ayj, 24) = 7 + AZ... 


In a computer program, it is awkward to introduce each of the partial-derivative 
functions (which must often be developed by hand, unless one has access to a program 
like MACSYMA) to use in Eqs, (2.37), An alternative technique is to approximate these 
partials by recalculating the function with a small perturbation to each of the variables 
in turn: 


2 FG 8, Yip By = Fis Dies =o 


(fk = 8 
A(x, y + 82 - ee 
(fs afi@iy +5 z,...) ~ Ai y, ) (2.39) 
6 
s = SiG, Yn 2) ET Big wind Hinds BH aes a J 


Similar relations are used for each variable in each function.* A computer program in 
the next section exploits this idea. 

It is interesting to observe that Newton’s method, as applied to a set of nonlinear 
equations, reduces the problem to solving a set of linear equations in order to determine 
the values that improve the accuracy of the estimates. This points out quite dramatically 
how linear and nonlinear problems vary in difficulty. 

Newton’s method has the advantage of converging quadratically, at least when we 
are near a root, but it is expensive in terms of function evaluations. For the 2 x 2 system 
above, there are six function evaluations at each step, while for a 3 * 3 system, there 
are twelve. For n simultaneous equations, the number of function evaluations is n? + n. 
One can see why this is rarely applied to large systems. It is good strategy, in all cases 
of simultaneous nonlinear equations, to reduce the number of equations as much as 
possible by solving for one variable in terms of the others and eliminating that one by 


* Approximation of derivatives by such difference quotients is discussed in Chapter 4. If the limiting value (as 
6— 0) of this ratio were used, we would have exactly the definition of a derivative. Since limits are not used, 
we have approximate values. 
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substituting for it in the other equations. For example, we should attack the previous 
examples as follows: “> 


Solve for y: y = 1 — e*. 
Substituting in the first equation, we get 


4—x?-(1-e7? =0, 
3—x? + 2e7-e% = 


We then use the methods of Chapter 1. 

When we must solve a larger system of nonlinear equations, a modification of 
Newton's method is often used. It converges less than quadratically but usually faster 
than linearly. Unfortunately, it may diverge unless we start fairly near to a root. In this 
method we do not recompute the matrix of partials at each step. Rather we use the same 
matrix for several steps before recomputing it again. For a system of n equations we 
would recompute the matrix after every n steps. In this way we need only n function 
evaluations at each step, except when we occasionally have to update the matrix of 
partials, which then adds an additional n? function evaluations. 

We first illustrate the method on the same 2 * 2 system above. Reworking Eq. (2.35) 


produces 
Axo] _ _[-2 3.4]-'{ 0.110 ] _ [| 0.0043). 
Ayo} | -2.7183 -1.0 —0.0183 —0.0298 |’ 
x 1.0] , | 0.0043} _ | 1.0043], 
vy =1.7 —0.0298 —1.7298 }’ 


fal=-ESon 2) 


=277183' =170 0.000196 
X2 = 1.004167, y. = —1.729635; 
F(X, ¥2) = —0.000011, g(x, y2) = —0.000002. 


—0.000133 
0.000165 


In addition we consider a different example. Here we have 


ek -y=0, 


xy =e = 0; 


Eq. (2.33) for this system becomes 


ew —1 || Ax] _ _|f@,. a] 
ye —TILAy: a(x. vd J 


Let x9 = 0.95, yo = 2.7. Then solving the preceding with the inverse matrix 


Axo] _ _ [2.5857 —1 ]~"[—0.1143 
Ayo} 0.1143 0.95 —0.0207 |” 


2.13 


we get x, = 1.00029, = y, = 2.71575; 
-1 
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Ax] _ _ [2.5857 -1 0.003325]. set 
Ay,} [0.1143 0.95} | -0.002533|° “* ® 


xy = 1.000048, yy = 2.718445. 


Since n = 2, we update our matrix of partials so that 


Ax] _ [2.7184 -1 ~'f 0.000033 sabes 
Ay} [0.00003 1.0000] {| —0.000163 J» Y* & 


x; = 1.000000, ys = 2.718282. 


This agrees with the exact solution: (1, e). 


CHAPTER SUMMARY 


To test your understanding of this chapter, ask yourself if you can do the following. If 
you cannot, restudy is indicated. 


wv 


Manipulate matrices and express a set of linear equations in matrix form. You should 
be able to expand the matrix representation into the standard form for a set of 
simultaneous equations. You should know the definitions of vector, scalar, identity 
matrix, triangular matrix, transpose, determinant, singular, and inverse as they 
relate to matrices. 

Solve a system of linear equations by row transformations, utilizing pivoting, and 
explain why pivoting is sometimes required and always helpful. 

Tell how Gaussian elimination differs from the Gauss—Jordan method and explain 
why the former is generally preferred. 

Rewrite a matrix as the product of LU pairs in several ways and create an LU matrix 
pair that implements Gauss elimination, You can apply this last to solve a system 
with multiple right-hand sides. 

Explain what is meant by scaling during the solution process and why inaccurate 
solutions may result if no scaling is done. 

Create examples of systems that have no solution for two different reasons and show 
that the matrix is singular in one case. You should be able to determine if a given 
matrix is singular or nearly so. 

Find the determinant of a matrix efficiently. 


Find the inverse of a matrix and use it to solve a system of equations. You should 
be able to explain why inverses are important even though they are not usually the 
preferred method of solving a system. 


List the properties that all norms must have and compute the various norms of 
matrices and vectors. You should know which norm gives the tightest bound to the 
magnitude of a matrix. 
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2.14 


10. Explain what the condition number of a matrix is and how it is involyed in estimating 
the accuracy of the solution. You should be able to use iterative improvement to 
reduce the error of a solution. 


11. Solve a system of equations by two iterative methods and explain why one usually 
converges faster. You should be able to explain when iteration is preferred over the 
direct methods. 


12. Describe the method of relaxation and use overrelaxation to speed up the conver- 
gence of iteration. 


13. Solve systems of nonlinear equations by iteration and by Newton's method; in the 
latter, explain how the number of function evaluations can be reduced and what 
sacrifice this entails. 


14. Utilize subroutines for the solution of sets of equations, writing driver programs to 
interface with them. 


15. Read, understand, and use subroutines for linear systems from one or more standard 
libraries. 


SELECTED READINGS FOR CHAPTER 2 


Forsythe and Moler (1967); Hageman and Young (1981); Stewart (1973). 


COMPUTER PROGRAMS 


Before we consider programs that apply to this chapter, it should be noted that there are 
several subroutines in IMSL (Appendix C) for solving a system of equations. For a system 
of nonlinear equations there is ZSPOW, which uses Newton's method. For a system of 
linear equations, LEQT2F is very easy to use. However, if one wishes to study a well- 
written program in this area, there is DECOMP and SOLVE in Forsythe, Malcolm, and 
Moler (1977). This does an LU decomposition and gives an approximation of the condition 
number of the matrix of coefficients. For iterative methods in solving AX = b there is a 
package of subroutines called ITPACK developed at the University of Texas. 

A number of examples of computer programs utilizing the methods of this chapter 
are exhibited. (These are grouped near the end of the chapter, as Figs. 2.6 through 2.13.) 
The first two employ the Gaussian elimination method—Program | for a general n X n 
array of coefficients and Program 2 for a tridiagonal system. The latter embodies an 
ingenious scheme to compress the matrix so that less storage is used by it for 500 equations 
than is required by the former for only 40. Programs 3, 4, and 5 are a combination of 
routines that form the LU equivalent of the coefficient matrix and then use it to solve the 
set of equations with one or several right-hand sides. Program 6 is a subroutine that can 
be used to scale the coefficient matrix when values that differ widely in magnitude are 
involved. Program 7 uses iteration to solve a set of equations; they must have the diagonal 
coefficients dominant. The last program solves n simultaneous nonlinear equations by 
Newton’s method. 
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The first Gaussian elimination program proceeds quite conventionally, utilizing the 
straightforward method that we discussed earlier in the chapter (see Section 2.4). To use 
subroutine LUD (Program 1, Fig. 2.6) the coefficient matrix is augmented with the constant 
vector and passed to the subprogram. The set of equations can be solved with multiple 
right-hand sides. The coefficient matrix is made upper-triangular (although the elements 
below the diagonal are not converted to zeros because we know that the row reduction 
would create a zero; hence it need not actually be done), In order to minimize round-off 
error as well as to avoid division by zero, we scan each column to find the element of 
largest magnitude on or below the diagonal, and interchange rows to put this largest 
element on the diagonal. In other words, partial pivoting is performed. If the largest 
element is very small (less than 1075), the program terminates with a message, concluding 
that the matrix is either singular or nearly so. In the case of a small divisor, the accuracy 
will be poor. (It is, of course, true that if all the coefficients are very small in magnitude, 
this is not the correct conclusion. It is assumed that the values have been scaled so that 
they are not all small.) 

(The pair of subprograms below, LUDCMP and SOLVLU, in Programs 3 and 5, can 
also be used to solve equations with multiple right-hand sides. These subroutines also 
can solve with a new right-hand-side vector at any time after LUDCMP has reduced the 
coefficient matrix, without repeating the reduction of the coefficient matrix.) 

Subroutine TRIDG (Program 2) in Fig. 2.7 solves a system with a tridiagonal coef- 
ficient matrix. To conserve storage in the computer, the tridiagonal matrix of coefficients 
is compressed, together with the constant terms, into an N x 4 matrix. Column one holds 
the coefficients to the left of the diagonal, column two holds the diagonal terms, column 
three holds the coefficient to the right of the diagonal, and column four holds the constant 
terms. The ith and (i — I)st rows of the compressed matrix correspond to elements in 
the uncompressed matrix, as shown: 


B11 4-12 4-13 Gj-14 
ay a2 a; 3 aig 


We can eliminate a; ; by subtracting a; ,/a;—, 2 times the (i — 1)st row from the ith row. 
Since we know a zero will replace a; , we don’t need actually to perform the arithmetic. 
The values of a; 5 and a; , change as follows: 


When the ith row is reduced, a; , is unaffected because there is a zero above it. Elements 
a,, and a,,3 are never referred to and their values can be anything, or they can be left 
undefined. 
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After reduction, a back-substitution is performed. The elements of the solution vector 
replace the constant vector in the fourth column of the matrix. 
The equations for back-substitution are 


ana 


Gy2 


— 4,3 * Gi+1.4 


G2 


The third, fourth, and fifth subroutines (Figs. 2.8 through 2.10) will ordinarily be 
used together to solve a linear system. The first, LUDCMP, accepts the square matrix of 
coefficients and forms the LU decomposition that is equivalent to the coefficient matrix. 
This is returned in place of the coefficient matrix by compacting the L and U matrices 
as described in Section 2.5. The algorithm is a straightforward implementation of the 
formulas of Section 2.5. In this routine, partial pivoting is employed by a call to Program 
4. This APVT subroutine does partial pivoting after a column of L’s has been obtained. 

The order of the rows, as affected by pivoting, is recorded in a vector named ORDER, 
so that the elements of the b vector can be rearranged into the same order as the coefficient 
matrix. A very small diagonal element is signaled by printing a message. 

After the LU decomposition has been obtained from LUDCMP, the solution to the 
system with a given right-hand-side vector is obtained by calling the SOLVLU subroutine, 
Program 5. In this program, the L-values are used to reduce the elements of the constant 
terms in the same way that the coefficient matrix was triangularized. The subroutine then 
does a back-substitution, using the reduced coefficients in the U matrix, and returns the 
solution vector to the caller. 

Additional calls to SOLVLU will give the solutions corresponding to new right-hand 
sides with economy of computational effort, since the LU decomposition is done only 
once. This scheme is recommended when a given set of equations must be solved with 
a series of constant terms. 

Program 6 (Fig. 2.11) is a utility subroutine that scales the values of the coefficient 
matrix so that the largest magnitude in each row is unity. This subroutine should be called 
before any of the previous routines are used if the coefficient matrix has values that are 
widely different in some rows than in other rows, as discussed in Section 2.3. 

Program 7 (Fig. 2.12) is a subroutine named GSITRN, which finds the solution of 
a linear system by iteration. To make the convergence as rapid as possible, the coefficients 
must be arranged so that the terms of largest magnitude are on the diagonal. In some 
systems this cannot be totally achieved; in that case one should approach the condition | 
as closely as possible. An initial approximation to the solution is passed to the subroutine 
as a parameter. il 

The first step in the subroutine is to divide each row by the diagonal term, including | 
the corresponding element in the right-hand-side vector; this avoids later divisions. Iter- 
ation continues until each component of the approximate solution vector changes less | 
than the input parameter TOL. If this does not occur within the number of iterations 
specified by NITER, the program prints a message and returns the last approximation. 
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Iteration is best adapted to sparse matrices.* The subroutine GSITRN does not 
illustrate this case. In fact, writing a general-purpose subroutine for iterative solutions to 
a sparse matrix is impossible, for this would require a repetitive pattern in the coefficients; 
such patterns cannot be anticipated. Many texts illustrate iteration with a tridiagonal 
system, but for these, iteration is not the best choice. A program like TRIDG, just 
discussed, has a minimum storage requirement and runs much faster than an iterative 
scheme. 

Program 8 (Fig. 2.13), named NLSYST, is a subroutine that solves a nonlinear 
system, Newton’s method (see Eqs. 2.37) is employed, The partial derivatives can 
be computed in either of two ways: numerically or analytically (for the former, 
METHOD = 0; for the latter, METHOD = 1). In the former case we check the ratio of 
a change in the function values for small changes in the independent variable. This 
estimation of the derivative by a difference quotient will be studied in greater detail in 
Chapter 4. The computation of the partials by either method is done in subroutine FCNJ. 
After all the partials have been approximated, the set of equations equivalent to Eq. 
(2.37) is solved by a call to subroutines LUD and SOLVE (Program 1), which give values 
for increments to the initial guesses for the variables that should make them closer to a 
solution of the nonlinear system. The subroutine repeats the procedure until convergence 
is obtained, 

The calling program passes initial guesses for the variables, a value for DELTA (the 
change in variable value used to estimate the partial derivative), and tolerance values 
(FTOL and XTOL) to terminate iterations when the function values are all sufficiently 
small or the changes in the x-values are all sufficiently small. The maximum number of 
iterations is limited by MAXIT. Function values are supplied through an external 
subroutine. 

Some judgment is needed to use NLSYST. First, the initial guesses for values of the 
variables must be near enough to a solution to give convergence. Second, the value for 
DELTA should be small enough to give a reasonable approximation to the partial derivative 
but not so small as to lead to excessive round-off. DELTA = 0.01 seems to give good 
results. 

Calling the subroutines of Programs | to 7 is straightforward but a driver for Program 
8 is more complex. 

As an illustration of how the subroutine NLSYST is used, the program in Fig. 2.5, 
together with the function subroutine, will determine a root of the pair of equations 


i] 


f(x, y) =x? + y?-5=0, 


g(x,y) =y-—e*-1=0. 


The starting value (2, 4) is supplied, along with the values for DELTA, FTOL, XTOL, 
and MAXIT. After six iterations, the root at (0.20437, 2.22672) was found. 


*A major area of application is in solving certain partial-differential equations. This is discussed in later chapters 
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PROGRAM NLSYS( INPUT, OUTPUT) 


THIS PROGRAM SOLVES THE EXAMPLE IN SECTION 2.12 
USING EITHER NUMERICAL PARIALS FOR THE JACOBIAN 
MATRIX (METHOD = 0) OR ANALYTIC PARTIALS 

(METHOD = 1). IN THE LATTER CASE THE MATRIX HAS 
BEEN PROVIDED IN SUBROUTINE FCNJ. 


REAL X(2),F(2) 

INTEGER N 

EXTERNAL FCN 

DATA DELTA, XTOL,FTOL,1I/0.125E-4,2*0.1E-8,0/ 


DATA MAXIT / 50/ 
DATA N,METHOD/2,0/ 
DATA X/2.0,4.0/ 


CALL NLSYST(FCN,N,MAXIT,X,F,DELTA,XTOL,FTOL,METHOD, 
PRINT 100 

FORMAT(///'THE X-VALUES ARE: '/) 

PRINT 200, (X(I), I = 1,N) 

FORMAT(T6,F8.4/) 

STOP 

END 


Figure 2.5 | 


ageanaanaaaaaa 


Figure 2.6 Program 1. 


PROGRAM PDEC (INPUT, OUTPUT) 


THIS PROGRAM SOLVES A SYSTEM OF EQUATIONS USING GAUSSIAN ELIM- } 
INATION WITH PARTIAL PIVOTING AND BACK SUBSTITUTION ACCORDING 
TO THE ALGORITHM GIVEN IN SECTION 2.4. 
THE ALGORITHM IS IMPLEMENTED IN TWO PROCEDURES: 
DECOMP: RETURNS THE LU DECOMPOSITION OF THE COEFFICIENT MATRIX 
SOLVE : SOLVE THE RIGHT HAND SIDES USING BOTH FORWARD AND 
BACK SUBSTITUTION 


REAL A(10,10), B(10), DET 
INTEGER IPVT(10), N,NDIM, 


DATA N,NDIM/4,10/ 


THE MATRIX A IS INPUT VIA A DATA STATEMENT COLUMN BY COLUMN, SINCE 
FORTRAN IS COLUMN-MAJOR. 


Figure 2.6 
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(continued) 

DATA A/6,2,4,7*0,1,2,-3,2, 6*0,-6, 3, 8*0, 
+ -5,2,1,1,6*0,60*0/ 

DATA B/6,-2,-7,7*0/ 


PRINT OUT ORIGINAL MATRIX 


PRINT ¢ (//)’ 
PRINT 100 
FORMAT (17X,’THE MATRIX A ', 13X,’ RIGHT HAND SIDE’/) 


DO 10 I = 1,N 
PRINT 200, (A(I,J), J = 1,N ), B(I) 
CONTINUE 

FORMAT (4F10.4, F15.4) 


WE FIND THE LU DECOMPOSITION OF THE MATRIX A 
CALL LUD (A,N, IPVT,NDIM, DET) 
PROCEDURE SOLVE NOW GETS SOLUTION BY FORWARD/BACK SUBSTITUTION 
CALL SOLVE (A,N, IPVT,B,NDIM) 
PRINT LU MATRIX AND SOLUTION VECTOR 
PRINT ‘ (//)' 
PRINT 300 
FORMAT (10X,’THE LU DECOMPOSITION OF A ',8X,’ SOLUTION VECTOR '/) 
DO 20 I = 1,N 
PRINT 200, (A(I,J), J = 1,N), B(I) 
CONTINUE 


STOP 
END 


SUBROUTINE LUD 


SUBROUTINE LUD(A,N, IPVT,NDIM, DET) 


SUBROUTINE LUD: THIS SUBROUTINE SOLVES A SET OF LINEAR EQUATIONS 


RETURNS THE LU DECOMPOSITION OF THE COEFFICIENT MATRIX, THE METHOD 
IS BASED ON THE ALGORITHM PRESENTED IN SECTION 4. 


INPUT: A - THE COEFFICIENT MATRIX 
N - THE NUMBER OF EQUATIONS 
NDIM - THE MAXIMUM ROW DIMENSION OF A 
OUTPUT: A - THE LU DECOMPOSITION OF THE MATRIX A 
THE ORIGINAL MATRIX A IS LOST 
IPVT - A VECTOR CONTAINING THE ORDER OF THE ROWS OF THE 
REARRANGED MATRIX DUE TO PIVOTING 
DET - THE DETERMINANT OF THE MATRIX. IT IS SET TO 


0 IF ANY PIVOT ELEMENT IS LESS THAN 0.00001. 


REAL A(NDIM,N), SAVE, RATIO, VALUE, DET 
INTEGER IPVT(N),N,NDIM, I, IPVTMT,NLESS1,IPLUS1,J,L 
INTEGER KCOL, JCOL, JROW, TMPVT 


DET = 1.0 


153 


154 CHAPTER 2: SOLVING SETS OF EQUATIONS 


Figure 2.6 (continued) 


NLESS1 = N - 1 
DO 10. I= 1,N 
IPVT(I) = 

10 CONTINUE 


i 


DO 20 I = 1,NLESS1 
IPLUS1 = I+1 
IPVIMT = 1 


FIND PIVOT ROW 


aaa 


DO 30 J = IPLUS1,N 
IF (ABS(A(IPVTMT,I)) .LT. ABS(A(J,I))) IPVTMT = J 
30 CONTINUE 


aaaaanaa 
2 
Ey 
Bi 
Q 
x 
y 
° 
» 
g 
& 
a 
” 
g 
3 
5 
tm 
fst 
mB 
= 
fst 
Z 
4 


IF (ABS (A(IPVTMT,I)). LT. 1.0E-05) THEN 
DET = 0.0 
PRINT ’ (//)" 
PRINT *, ’ MATRIX IS SINGULAR OR NEAR SINGULAR 
PRINT ° (//)' 
RETURN 
ENDIF 


INTERCHANGE ROWS IF NECESSARY 


aaa 


IF (IPVIMT .NE. I) THEN 
TMPVT = IPVT(I) 
IPVT(I) = IPVT(IPVIMT) 
DO 40 JCOL = 1,N 
SAVE = A(I,JCOL) 
A(I,JCOL) = A(IPVTMT,JCOL) 
A(IPVTMT, JCOL) = SAVE 
40 CONTINUE 
IPVT(IPVTIMT) = TMPVT 
DET = -DET 
END IF 


REDUCE ALL ELEMENTS BELOW THE I’TH ROW 


aaa 


DO 50 JROW = IPLUS1,N 
IF (A(JROW,I) .NE. 0.0) THEN 
A(JROW,I) = A(JROW,I)/A(I,1) 
DO 60 KCOL = IPLUS1,N 
A(JROW,KCOL) = A(JROW,KCOL) - A(JROW,I) *A(I,KCOL) 
60 CONTINUE 
END IF 
50 CONTINUE 


20 CONTINUE 


IF (ABS(A(N,N)) .LT. 1.0E-5 ) THEN 
PRINT ' (//)’ 
PRINT *, ’ MATRIX IS SINGULAR OR NEAR SINGULAR 
PRINT '(//)" 
DET = 0.0 
RETURN 
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Figure 2.6 (continued) 


END IF 
c 
G) (eddost aimee se anwel sais ae caeinee ese aati eareeneasers= 
c 
c COMPUTE THE DETERMINANT OF THE MATRIX 
c 
Q shbeesceeeeee sehen casa scscensss ceetesscse ee eeseeeesseeecesees 
¢ 

DO 70 I = 1,N 

DET = DET * A(I,I) 

70 CONTINUE 
c 

RETURN 

END 


SUBROUTINE SOLVE (A,N, IPVT,B,NDIM) 


MAKING USE OF THE LU DECOMPOSITION OF THE MATRIX A, THIS SUB- 
ROUTINE SOLVES THE SYSTEM BY FORWARD AND BACK SUBSTITUTION, 


INPUT: A - LU MATRIX FROM SUBROUTINE LUD 
N - NUMBER OF EQUATIONS 
IPVT - A RECORD OF THE REARRANGEMENT OF THE ROWS 
OR A FROM SUBROUTINE LUD 
B - RIGHT HAND SIDE OF THE SYSTEM OF EQUATIONS 


OUTPUT: B - THE SOLUTION VECTOR 


aagaAaAAAAAARAAARAARAA AAAAAAA 


REAL A(NDIM,N), B(N), X(10), SUM 
INTEGER IPVT(N),N,NDIM, IROW, JCOL, I 


ce 
@. wieedeeeccccne eee eddensensncs ean eset casnen Se secedaesee eee aeee 
c 
c REARRANGE THE ELEMENTS OF THE B VECTOR. STORE THEM IN THE 
c X VECTOR. 
@ 

DO 10 I = 1,N 

X(I) = B(IPVT(I)) 

10 CONTINUE 
c 
pees ea a eon ee ee ee, 
c 
c SOLVE USING FORWARD SUBSTITUTION--LY = B 
c 
iQ) seenweceeweseudd Secon ieweteeeseened see easeceeencuessedeseusesece 
ie 


DO 20 IROW = 2,N 
SUM = X(IROW) 
DO 30 JCOL = 1, (IROW-1) 
SUM = SUM - A(IROW, JCOL) *X (JCOL) 
30 CONTINUE 
X(IROW) = SUM 
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Figure 2.6 (continued) 


CONTINUE 


° 


SOLVE BY BACK SUBSTITUTION--UX = ¥ 


agaNgagANAAAaAN 


B(N) = X(N) /A(N,N) 
DO 40 ITROW = (N-1),1,-1 
SUM = X(IROW) 
DO 50 JCOL = (IROW+1),N 
SUM = SUM - A(IROW, JCOL) *B(JCOL) 
CONTINUE 
B(IROW) = SUM/A(IROW, IROW) 
CONTINUE 


RETURN 
END 
OUTPUT FOR PROGRAM 1 


THE MATRIX A RIGHT HAND SIDE 


6.0000 1.0000 -6.0000 -5.0000 6.0000 
2.0000 2.0000 3.0000 2.0000 -2.0000 
4.0000 -3.0000 +0000 1.0000 -7.0000 

0000 2.0000 +0000 1.0000 «0000 


THE LU DECOMPOSITION OF A SOLUTION VECTOR 


1.0000 -6.0000 -5.0000 -.5000 
-3.6667 4.0000 4.3333 1.0000 
~.4545 6.8182 5.6364 -3333 
-.5455 +3200 1.5600 -2.0000 


PROGRAM PDIAG (INPUT, OUTPUT) 
REAL A(4,4) 
INTEGER N,I 
DATA N,A/4,4*1.0,4*-4.0,1.0,2.0,1.0,0.0,-5,11,-22,-5/ 
PRINT 300 
DO 10 I = 1,N 
PRINT 100, (A(I,J), J = 1,4) 
CONTINUE 
NDIM = N 
CALL TRIDG(A,N,NDIM) 
PRINT 400 
DO 20 I = 1,N 
PRINT 100, (A(I,3), 3 = 1,4) 
20 CONTINUE j 
100 FORMAT (T5, 4 (F10.4,3X)) 
300 FORMAT(//T10,10(’*’),’ ON INPUT THE MATRIX A IS ‘,10(’*")/) 
400 FORMAT (//T10,10(’*"’), * ON OUTPUT THE MATRIX IS °,10("*")/) 
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Figure 2.7 (continued) 


SUBROUTINE TRIDG : 

THIS SUBROUTINE PERFORMS GAUSSIAN ELIMINATION 
ON A TRIDIAGONAL SYSTEM. THE COEFFICIENTS OF N EQUATIONS ARE 
STORED IN THE N X 4 ARRAY, A. THE FIRST COLUMN OF A HOLDS THE 
ELEMENTS TO THE LEFT OF THE DIAGONAL, THE SECOND HOLDS THE DIA- 
GONAL ELEMENTS, AND THE THIRD HOLDS THE ELEMENTS TO THE RIGHT. 
THE FOURTH COLUMN HOLDS THE RIGHT HAND SIDE TERMS. 


PARAMETERS ARE : 


A - MATRIX OF COEFFICIENTS AND R.H.S. AS DESCRIBED 
N - NUMBER OF EQUATIONS 
NDIM - FIRST DIMENSION OF A IN THE CALLING PROGRAM 


THE SOLUTION VECTOR IS RETURNED IN THE FOURTH COLUMN OF A. 
THE FIRST THREE COLUMNS OF A ON OUTPUT WILL CONTAIN THE 
LU DECOMPOSITION OF THE ORIGINAL TRIDIAGONAL MATRIX. 


A29AAAAAAAAAAAAAAAAAAAAAAA 


SUBROUTINE TRIDG(A,N,NDIM) 
REAL A(NDIM, 4) 
INTEGER N,NDIM,I,NM1,M 


ic 
Co mn a nnn 
c { 
C FIRST WE ELIMINATE ALL BELOW DIAGONAL TERMS | 
c 
DO 10 I = 2,N 
A(I,1) = A(I,1)/A(I-1,2) 
A(I,2) = A(I,2) ~ A(I,1)*A(I-1,3) 
A(I,4) = A(I,4) - A(I,1)*A(I-1,4) 
10 CONTINUE 
c 
C= 
c 
C NOW WE DO THE BACK SUBSTITUTING 
c 
NMI =N- 1 
A(N,4) = A(N,4) / A(N,2) 
DO 20 I = NM1,1,-1 
c 
C THE INDEX M WILL COUNT UP THE ROWS 
C . 
A(I,4) = ( A(I,4) - A(1,3)*A(I+1,4) ) / A(I,2) 
20 CONTINUE 
RETURN 
END 


OUTPUT FOR PROGRAM 2 


kk eK ON INPUT THE MATRIX A IS *#*x*#eK RH 


1.0000 -4.0000 1.0000 -5.0000 
1.0000 -4.0000 2.0000 11.0000 
1.0000 -4.0000 1.0000 -11.0000 


1.0000 -4.0000 0.0000 -5.0000 
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Figure 2.7 (continued) 


_ 
eeeeereees ON ODTEDD TEE MATRIX IS **teeeeee= 


1.0000 —4.0000 1.0000 
—.2500 —3.7500 2.0600 —1.0000 
- .2667 —3.4667 2.0000 3.0000 
—.2885 —3.7125 0.0000 


SUBROUTINE LUDCMP (A,N,NDIM,ORDER) 


SUBROUTINE LUDCMP = 

TSIS SUBROUTINE COMPUTES Ez 
MATRICES EQUIVALENT TO THE A MATRIX, SUCH THAT = 2. 
MATRICES ARE RETURNED IN THE SPACE OF A, IN COMPACT FORM. 
MATRIX HAS ONES ON ITS DIAGONAL. PARTIAL PIVOTING ES USED TO 
MAXIMUM VALUED ELEMENTS ON THE DIAGONAL OF L. THE ORDER OF T 
AFTER PIVOTING IS RETURNED IN VECTORS IN TEE 
S SHOULD BE USED TO REORDER R.8.S. VECTORS BEFORE SOLV 
“ AX = B. 


SION OF A IN THE CALLING PROGRAM 
BOLDING ROW ORDER AFTER PIVOTING 


ANANAAAANAAANNAAAANAAAAAAAAA 


REAL A(NDIM,N) 

GER N,NDIM, ORDER (N) 

Sum 

SGER I,KCOL,NML, IMI, JCOL,JP1,IROW 


‘ABLISE INITIAL ORDERING IN ORDER VECTOR 
no 10 FT =1,N 
CONTINUE 
PIVOTING FOR FIRST COLUMNS SY CALL TO SUBROUTINE 2PVT 
CALL APVT (A,N,MDIM, ORDER, 1) 
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Figure 2.8 (continued) 


aaaaaana aaeaQ 


eaaq 


aaaaa 


aaa 


aaaaa 


IF PIVOT ELEMENT VERY SMALL, PRINT ERROR MESSAGE AND RETURN 


IF ( ABS(A(1,1)) .LT. 1.0E-5 ) THEN 
PRINT 100 
RETURN 

END IF 


NOW COMPUTE ELEMENTS FOR FIRST ROW OF U 


DO 20 KCOL = 2,N 
A(1,KCOL) = A(1,KCOL) / A(1,1) 
20 CONTINUE 


COMPLETE THE COMPUTING OF L AND U ELEMENTS. THE GENERAL PLAN IS TO 
COMPUTE A COLUMN OF L’S, THEN CALL APVT TO INTERCHANGE ROWS, AND THEN 
GET A ROW OF U’S, 


NMl =N- 1 
DO 80 JCOL = 2,NM1 


FIRST COMPUTE A COLUMN OF L’S 


JM1 = JCOL - 1 
DO 50 IROW = JCOL,N 
SUM = 0 
DO 40 KCOL = 1,JM1 
SUM = SUM + A(IROW,KCOL) *A(KCOL, JCOL) 
40 CONTINUE 
A(IROW,JCOL) = A(IROW,JCOL) - SUM 
50 CONTINUE 


NOW INTERCHANGE ROWS IF NEED BE, THEN TEST FOR TOO SMALL PIVOT 


CALL APVT(A,N,NDIM, ORDER, JCOL) 

IF ( ABS(A(JCOL,JCOL)) .LT. 1.0E-5 ) THEN 
PRINT 100 
RETURN 

END IF '- 


NOW WE GET A ROW OF U’S 


gP1 = JCOL + 1 
DO 70 KCOL = JP1,N 
SUM = 0 
DO 60 IROW = 1,JM1 
SUM = SUM + A(JCOL, IROW) *A(IROW, KCOL) 
60 CONTINUE 
A(JCOL,KCOL) = ( A(JCOL,KCOL) - SUM ) / A(JCOL,JCOL) 
70 CONTINUE 
80 CONTINUE 


STILL NEED TO GET LAST ELEMENT IN L MATRIX 


SUM = 0 
DO 90 KCOL = 1,NM1 
SUM = SUM + A(N,KCOL) *A(KCOL,N) 
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Figure 2.8 (continued) 


90 CONTINUE 
A(N,N) = A(N,N) - SUM 
RETURN 


100 PORN VERY SMALL PIVOT ELEMENT INDICATES A NEARLY’, 
’ SINGULAR MATRIX. ') 
ve 


SUBROUTINE APVT(A,N,NDIM, ORDER, JCOL) 


SUBROUTINE APVT : 

THIS SUBROUTINE FINDS THE LARGEST ELEMENT FOR 
PIVOT IN JCOL OF MATRIX A, PERFORMS INTERCHANGES OF ELEMENTS IN A 
AND ALSO INTERCHANGES THE ELEMENTS IN THE ORDER VECTOR. 


MATRIX OF COEFFICIENTS WHOSE ROWS ARE TO BE INTERCHANGED 
NUMBER OF EQUATIONS 

FIRST DIMENSION OF A IN THE MAIN PROGRAM 

INTEGER VECTOR TO HOLD ROW ORDERING 

COLUMN OF A BEING SEARCHED FOR PIVOT ELEMENT 


c 
c 
Cc 
c 
c 
¢ 
c 
ey 
c 
c 
Cc 
¢ 
¢ 
Cc 
c 
c 
Cc 
c 
c 
Cc 


REAL A(NDIM,N) 

INTEGER ORDER (N) ,N,NDIM, JCOL 
REAL SAVE, ANEXT, BIG 

INTEGER IPVT, JP1,IROW,KCOL, ISAVE 


FIND PIVOT ROW, CONSIDERING ONLY THE ELEMENTS ON AND BELOW DIAGONAL 


IPVT = JCOL 
BIG = ABS (A(JCOL,JCOL)) 
JP1 = JCOL + 1 
DO 10 IROW = JP1,N 
ANEXT = ABS (A(IROW, JCOL) ) 
IF ( ANEXT .GT. BIG ) IPVT = IROW 
10 CONTINUE 


NOW INTERCHANGE ROW ELEMENTS IN THE ROW WHOSE NUMBER EQUALS JCOL WITH 
THE PIVOT ROW UNLESS PIVOT ROW IS JCOL. 


IF ( IPVT .EQ. JCOL ) THEN 
RETURN 

END IF 

DO 20 KCOL = 1,N 


Figure 2.9 Program 4. 
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Figure 2.9 (continued) 


SAVE = A(JCOL,KCOL) 
A(JCOL,KCOL) = A(IPVT,KCOL) 
A(IPVT,KCOL) = SAVE 

20 CONTINUE 


NOW SWITCH ELEMENTS IN THE ORDER VECTOR 


ISAVE = ORDER (JCOL) 

ORDER (JCOL) ORDER (IPVT) 
ORDER(IPVT) = ISAVE 
RETURN 

END 


SUBROUTINE SOLVLU(LU,B,X,N,NDIM, ORDER) 


SUBROUTINE SOLVLU : 

THIS SUBROUTINE IS USED TO FIND THE SOLUTION 
TO A SYSTEM OF EQUATIONS, AX = B, AFTER THE LU EQUIVALENT OF A HAS 
BEEN FOUND. BEFORE USING THIS ROUTINE, THE VECTOR B SHOULD BE 
SCALED IF MATRIX A WAS SCALED, USING THE SAME SCALE FACTORS. WITHIN 
THIS ROUTINE, THE ELEMENTS OF B ARE REARRANGED IN THE SAME WAY THAT 
THE ROWS OF A WERE INTERCHANGED, USING THE ORDER VECTOR WHICH HOLDS 
THE ROW ORDERINGS. THE SOLUTION IS RETURNED IN X. 


PARAMETERS ARE : 


- THE LU EQUIVALENT OF THE COEFFICIENT MATRIX 
- THE VECTOR OF RIGHT HAND SIDES 
SOLUTION VECTOR 
NUMBER OF EQUATIONS 
FIRST DIMENSION OF A IN THE MAIN PROGRAM 
INTEGER ARRAY OF ROW ORDER AS ARRANGED DURING PIVOTING 


¢ 
Cc 
c 
c 
(o} 
c 
Cc 
c 
Cc 
Cc 
c 
c 
c 
i 
c 
c 
Cc 
c 
c 
is] 
c 
c 
Cc 
Cc 
c 


REAL LU(NDIM,N),B(N) ,X(N) 

INTEGER ORDER (N) ,N,NDIM 

REAL SUM 

INTEGER I,J, JCOL, IROW, NVBL, IM1,NP1 


REARRANGE THE ELEMENTS OF THE B VECTOR. X IS USED TO HOLD THEM. 


Do 10 I =1,N 
J = ORDER(I) 
X(I) = B(d) 
10 CONTINUE 


Figure 2.10 Program 5. 
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Figure 2.10 (continued) 


~~ 


COMPUTE THE B’ VECTOR, STORING BACK IN X 


90000 


x@) = x@) / 20(1,1) 
DO 50 IROW = 2,N 
IMi = TROW - 1 
SUM = 0.0 
DO 40 JCOL = 1,1Mi 
SUM = SUM + LU(IROW, JCOL) *x(JCOL) 
40 CONTINUE 
X(ZROW) = (X(TROW) - SUM) / LU(IROW, IROW) 
50 CONTINUE 
c 


c 
C NOW GET THE SOLUTION VECTOR, X(N) = X(N) ALREADY. 
c 


DO 70 IROW = 2,N 
NVBL = N - IROW +1 
suM = 0.0 
NP1 = NVBL + 1 
DO 60 JCOL = NP1, ¥ 
SUM = SUM + LU(NVBL, JCOL) *X(JCOL) 
60 CONTINUE 
X(NVBL) = X(NVBL) - SUM 
CONTINUE 


RETURN 
END 


SUBROUTINE SCALES (A,N,NDIM,SCFAC) 


SUBROUTINE SCALES : 

THIS SUBROUTINE SCALES THE VALUES IN AN N X N 
COEFFICIENT MATRIX SO THAT THE LARGEST ELEMENT IN EACH ROW IS 
UNITY. THE SCALED VALUES OF A ARE RETURNED IN Ti A MATRIX AND T. 
SCALE FACTORS FOR EACH ROW ARE RETURNED IN SCFAC VECTOR. USE SC. 
TO SCALE THE ELEMENTS IN THE B VECTOR ( R.H. SIDES ) BEFORE 
SOLVING THE SET OF EQUATIONS AX = B. 


A - MATRIX OF COEFFICIENTS 
N - NUMBER OF EQUATIONS 
NDIM - FIRST DIMENSION OF A IN THE CALLING PROGRAM 


SCFAC - ARRAY TO HOLD THE SCALE FACTORS 


eNANANAANAAAAAAANAAANADAA 


Figure 2.11 Program 6. 
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Figure 2.11 (continued) 


REAL A(NDIM,NDIM) , SCFAC (N) 
INTEGER N,NDIM, I,J 
REAL BIG, ANEXT 


6 
ee ee yy cameo Aen eae aa Hae Rn eae ees 
c 
C FIND THE LARGEST VALUE IN EACH ROW. IF ANY ROW HAS ONLY ZEROES, 
C PRINT MESSAGE AND RETURN. 
c 
DO 20 I = 1,N 
BIG = ABS(A(I,1)) 
DO 10 J = 2,N 
ANEXT = ABS(A(I,d)) 
IF ( ANEXT .GT, BIG ) BIG = ANEXT 
10 CONTINUE 
IF ( BIG .EQ, 0 ) THEN 
PRINT 200, I 
END IF 
SCFAC(I) = 1.0 / BIG 
20 CONTINUE 
c 
Go seteaeceneeaciabewe ees neater eee eee eeceeen nner awores 
c 
C NOW SCALE THE A VALUES 
c 
Do 40 I = 1,N 
Do 50 J =1,N 
A(I,J) = A(1,d) *SCFAC (I) 
50 CONTINUE 
40 CONTINUE 
¢ 
200 FORMAT(/’ ALL ELEMENTS IN ROW ’,13,’ ARE ZERO.') 
RETURN 
END 


SUBROUTINE: GSITRN(A,B,X,N,NDIM, NITER, TOL) 


SUBROUTINE GSITRN : 

THIS SUBROUTINE OBTAINS THE SOLUTION TO N LINEAR 
EQUATIONS BY GAUSS-SEIDEL ITERATION. AN INITIAL APPROXIMATION IS SENT 
TO THE SUBROUTINE IN THE VECTOR X. THE SOLUTION, AS APPROXIMATED BY 
THE SUBROUTINE, IS RETURNED IN X. THE ITERATIONS ARE CONTINUED UNTIL 
THE MAXIMUM CHANGE IN ANY X COMPONENT IS LESS THAN TOL, IF THIS 
CANNOT BE ACCOMPLISHED IN NITER ITERATIONS, A MESSAGE IS PRINTED 
AND THE CURRENT APPROXIMATION IS RETURNED. THE COEFFICIENTS ARE 
TO BE ARRANGED IN A SO AS TO HAVE THE LARGEST VALUES ON THE 
DIAGONAL. 


PARAMETERS ARE : 


aaaaaAAaaAaAAaAAAAAAAA 


Figure 2.12 Program 7. 
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Figure 2.12 (continued) 


Cc A - COEFFICIENT MATRIX WITH LARGEST VALUES ON THE DIAGONAL 

Cc B - RIGHT HAND SIDE VECTOR 

Cc x - INITIAL APPROXIMATION TO SOLUTION, ALSO RETURNS RESULT 

Cc N - THE NUMBER OF EQUATIONS 

c NDIM - FIRST DIMENSION OF A IN THE CALLING PROGRAM 

Cc NITER - LIMIT TO THE NUMBER OF ITERATIONS 

Cc TOL - TEST VALUE TO STOP ITERATING THE SOLUTION 

Cc 
ew a 
Cc 


REAL A(NDIM,NDIM) ,B(N) ,X(N),TOL 
INTEGER N,NDIM,NITER, I,J 


CAN SAVE SOME DIVISIONS BY MAKING THE DIAGONAL ELEMENTS EQUAL TO UNITY 


DO 10 I = 1,N 
SAVE = A(I,1) 
B(I) = B(I) / SAVE 
pO 5 J =1,N 
A(I,J) = A(I,J) / SAVE 
5 CONTINUE 
10 CONTINUE 


NOW WE PERFORM THE ITERATIONS. STORE MAX CHANGE IN X VALUES FOR TESTING 
AGAINST TOL. OUTER LOOP LIMITS ITERATIONS TO NITER. 


DO 40 ITER = 1,NITER 
XMAX = 0 
DO 30 I = 1,N 
SAVE = X(I) 
X(I) = B(I) 
DO 20 J = 1,N 
IF ( J .NE., I ) THEN 
X(I) = X(I) - A(I,J)*X(J) 
END IF 
CONTINUE 
IF ( ABS(X(I) - SAVE) .GT. XMAX ) THEN 
XMAX = ABS( X(I) - SAVE ) 
END IF 
30 CONTINUE 
IF ( XMAX .LE. TOL ) RETURN 
40 CONTINUE 


NORMAL EXIT FROM THE LOOP MEANS NON-CONVERGENT IN NITER ITERATIONS. 
PRINT MESSAGE AND RETURN. 


PRINT 200, TOL,NITER 
200 FORMAT(/’ DID NOT MEET TOLERANCE OF ',E14.6,’ IN ‘',14, 
¥ ’ ITERATIONS’ /’ LAST VALUE OF X WAS RETURNED TO CALLER’) 
RETURN 
END 
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Cc: 


SUBROUTINE NLSYST(FCN,N,MAXIT,X,F,DELTA, XTOL, FTOL, METHOD, I) 


SUBROUTINE NLSYST : 

THIS SUBROUTINE SOLVES A SYSTEM OF N NON- 
LINEAR EQUATIONS BY NEWTON’S METHOD. THE PARTIAL DERIVATIVES OF 
THE FUNCTIONS ARE ESTIMATED BY DIFFERENCE QUOTIENTS WHEN A 
VARIABLE IS PERTURBED BY AN AMOUNT EQUAL TO DELTA ( DELTA IS 
ADDED ). THIS IS DONE FOR EACH VARIABLE IN EACH FUNCTION. 
INCREMENTS TO IMPROVE THE ESTIMATES FOR THE X-VALUES ARE 
COMPUTED FROM A SYSTEM OF EQUATIONS USING SUBROUTINES 
LUD AND SOLVE 


PARAMETERS ARE : 


FCN - SUBROUTINE THAT COMPUTES VALUES OF THE FUNCTIONS. MUST 
BE DECLARED EXTERNAL IN THE CALLING PROGRAM. 

N - THE NUMBER OF EQUATIONS 

MAXIT - LIMIT TO THE NUMBER OF ITERATIONS THAT WILL BE USED 

x - ARRAY TO HOLD THE X VALUES. INITIALLY THIS ARRAY HOLDS 
THE INITIAL GUESSES. IT RETURNS THE FINAL VALUES. 

F - AN ARRAY THAT HOLDS VALUES OF THE FUNCTIONS 


DELTA - A SMALL VALUE USED TO PERTURB THE X VALUES SO PARTIAL 
DERIVATIVES CAN BE COMPUTED BY DIFFERENCE QUOTIENT. 

XTOL - TOLERANCE VALUE FOR CHANGE IN X VALUES TO STOP ITERA- 
TIONS. WHEN THE LARGEST CHANGE IN ANY X MEETS XTOL, 
THE SUBROUTINE TERMINATES. 


FTOL - TOLERANCE VALUE ON F TO TERMINATE, WHEN THE LARGEST F 
VALUE IS LESS THAN FTOL, SUBROUTINE TERMINATES. 
zt - RETURNS VALUES TO INDICATE HOW THE ROUTINE TERMINATED 


I=1 XTOL WAS MET 

I=2 FTOL WAS MET 

=-1 MAXIT EXCEEDED BUT TOLERANCES NOT MET 

I=-2 VERY SMALL PIVOT ENCOUNTERED IN GAUSSIAN ELIMINATION 
STEP - NO RESULTS OBTAINED 

I=-3 INCORRECT VALUE OF N WAS SUPPLIED - N MUST BE BETWEEN 
2 AND 10 

METHOD - ALLOWS ONE TO COMPUTE THE JACOBIAN MATRIX IN SUBROUTINE 

FCNJ BY NUMERICAL OR ANALYTIC PARTIAL DERIVATIVES. 

0 IMPLIES NUMERICAL PARTIALS ARE TO BE COMPUTED. 

i IMPLIES ANALYTIC PARTIALS. THIS REQUIRES THE USER TO 
INPUT THESE PARTIAL DERIVATIVES IN FCNJ. 


ANANAAAAANAAAAAANAAAAAARAAAAAAAANAAAAAAAAAAAAAAAAAA 


REAL X(N),F(N) , DELTA, XTOL,FTOL 

INTEGER N,MAXIT,I 

REAL A(10,10), XSAVE(10),FSAVE(10),B(10) 
COMMON A, XSAVE, FSAVE 

INTEGER IPVT(10),IT,IVBL, ITEST, IFCN, IROW, JCOL 


CHECK VALIDITY OF VALUE OF N 


aaaana 


IF (N .LT. 2 .OR. N .GT. 10 ) THEN 
I= -3 


Figure 2.13 Program 8. 


T2iL FORD (SCR, METHOD, 5,K,L,F, DEL) 


T25s, Sie 5, 2272, 3, 20) 


2 SS TERT TSE CESSSCISST SS SS SO WO T-TREE 


iz. 1.05 } T= 


23332 TEE C§UEESOTISS TD TE K WEES, A150 SHE IF EPO ES MET. 


aIssr = 8 

BD 2 Te = 1,5 

Ze) = See ee) + Bee) 

ES ( SS5(6(70E2)) -CT. EM 9 TEST = TEST + 2 
SD CORTINGE 
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Figure 2.13 (continued) 


¢ 
© eieeete teen decenwe seme ceecentanennee lst nwaeeeeeeameeres 
c 
C IF XTOL IS MET, PRINT LAST VALUES AND RETURN, ELSE DO ANOTHER 
C ITERATION, 
c 
IF ( ITEST .EQ. 0 ) THEN 
I=1 
IF ( I .EQ. 0 ) PRINT 1002, IT,X 
RETURN 
END IF 
100 CONTINUE 
c 
Cr ne 
c 
C WHEN WE HAVE DONE MAXIT ITERATIONS, SET I = -1 AND RETURN 
c 
I=-1 
RETURN 


c 
1000 FORMAT(/’ AFTER ITERATION NUMBER’,13,’ X AND F VALUES ARE’ 
+ //10F13.5) 
1001 FORMAT (/10F13.5) 
1002 FORMAT(/’ AFTER ITERATION NUMBER’,I3,’ X VALUES (MEETING’, 
+ ' XTOL) ARE '//10F13.5) 
1003 FORMAT(/’ CANNOT SOLVE SYSTEM. MATRIX NEARLY SINGULAR.’) 
1004 FORMAT(/’ NUMBER OF EQUATIONS PASSED TO NLSYST IS INVALID.', 
+ ‘ MUST BE 1 < N < 11. VALUE WAS ‘',13) 


fo) 


SUBROUTINE FCN: 
THE TWO NONLINEAR FUNCTION ARE DEFINED 


aaaaaaaa 


SUBROUTINE FCN (X,F) 
REAL X(*),F(*) 


F(1) = X(1)*X(1) + X(2)*X(2) - 5.0 
F(2) = X(2) - EXP(X(1)) - 1.0 
RETURN 

END. 


SUBROUTINE FCNJ: 
HERE THE MATRIX OF PARTIALS ARE COMPUTED: 
METHOD: 0 - IMPLIES NUMERICAL PARTIALS 
1 - IMPLIES ANALYTIC PARTIALS 
WHICH MUST BE SUPPLIED. 


aaaaaaanaaaa 


SUBROUTINE FCNJ(FCN,METHOD,B, N,X,F,DELTA) 
INTEGER N 

REAL X(N), F(N), DELTA 

INTEGER IROW, JCOL, NP 

REAL A(10,10), FSAVE(10), XSAVE(10), B(10) 
COMMON A, XSAVE, FSAVE 
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Figure 2.13 (continued) 


NP =N+1 

DO 10 IROW = 1,N 
FSAVE(IROW) = F(IROW) 
XSAVE(IROW) = X(IROW) 
CONTINUE 


° 


IF (METHOD .EQ. 0) THEN 
COMPUTE NUMERICAL PARTIALS 


THIS DOUBLE LOOP COMPUTES THE PARTIAL DERIVATIVES OF EACH FUNCTION 
FOR EACH VARIABLE AND STORES THEM IN A COEFFICIENT ARRAY. 


aaqaqnn9q aK 


DO 50 JCOL = 1,N 
X(JCOL) = XSAVE(JCOL) + DELTA 
CALL FCN(X,F) 
DO 40 IROW = 1,N 
A(IROW,JCOL) = (F(IROW) - FSAVE(IROW)) / DELTA 
40 CONTINUE 


RESET X VALUES FOR NEXT COLUMN OF PARTIALS 


aaa 


X(JCOL) = XSAVE(JCOL) 
50 CONTINUE 
ELSE 


COMPUTE ANALYTIC PARTIALS 


aaa 


A(1,1) 
A(2,1) 
A(1,2) 
A(2,2) 


2.0*X(1) 
EXP (X(1)) 
2.0*X(2) 
1.0 


a 


ENDIF 


NOW WE PUT NEGATIVE OF F VALUES AS RIGHT HAND SIDES 


aaqaqaan 


DO 60 IRW=1,N 
B(IROW) = - FSAVE(IROW) 
0 CONTINUE 
RETURN 
END 


ry 


SUBROUTINES LUD AND SOLVE 
FROM PROGRAM 1 MUST BE 
INCLUDED HERE. 


aaan0 
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Figure 2.13 (continued) 


OUTPUT FOR PROGRAM 8 


AFTER ITERATION NUMBER 1 X AND VALUES 
2.00000 4.00000 
15.00000 -4.38906 

AFTER ITERATION NUMBER 2 VALUES 
1.20599 2.52201 
2.81494 -1,81804 

AFTER ITERATION NUMBER 3 VALUES 
- 58368 2.26151 
145513 =, 53142 

AFTER ITERATION NUMBER 4 VALUES 
+27563 2.24040 
+09535 -.07696 

AFTER ITERATION NUMBER 5 VALUES 
«20742 2.22751 
.00482 -.00300 

AFTER ITERATION NUMBER 6 VALUES 
«20434 2.22671 
00001 -,00001 


AFTER ITERATION NUMBER 7 VALUES 


+20434 2.22671 


-00000 -00000 


THE X-VALUES ARE: 
2043 
2.2267 
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EXERCISES 
Section 2.2 
/1) Given the matrices A, B and vectors x, y where 
Af 1-3 2 1 > 
> A=| 4S “65 31 x=] 3) 
-1 9 7 0 = 
4 


“Al 2s: ; 
B=| 6 9-2], y=] of 


13 


ae 
nv 


a) Find A + B, 2A — 3B, 5x — 3y. 
»b) Find Ax, By, x7y. 
c) Find A’, BT. 
2. Given the matrices 


=—2 os Les fe me | 
A=| 2 0 <4 =a B= }l 3 —4l- 
1 


a) Find BA, B?, AAT. 

b) Find det (B). 

c) Find tr (B). 

d) A square matrix can always be expressed as a sum of an upper-triangular and a lower- 
triangular matrix. Find a U and an L such that U + L = B. 

e) A square matrix can also be expressed as a sum of a lower- and upper-triangular and a 
diagonal matrix. Find L + D + U=B. 


3. Let 
be — ey ae | 
A= 3 1 1}, B=|1, 3 =-S), 
2 OL 2 i =f. 


a) Show that AB = BA = / where / is the 3 X 3 identity matrix, 


100 
7=/0 1 O}. 
001 


We shall later define such a matrix B as the inverse of A. 
b) Show that AJ = JA = A. 
»c) Show that AC = CA and also BC + CB. In general, matrices do not commute under 
multiplication. 
d) A square matrix can also be expressed as a sum of a lower- and upper-triangular and a 
_ diagonal matrix. Find L + D+ U =A. 


f 
4. Write as a set of equations: 


1 3 1 -1]xy 3 
2 Ueeat Pay (iy ba 
0-1 4 = « Ifix,] | 6 
0 1 1 —Silx} (16, 


Section 2.3 


5.) Solve by elimination: 
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6. 


7 
8 


9; 


5-1 O}fx, 9 
-1 5 -I|ln]=] 4]. 
0-1 Silxy} L-6 


a) Solve by back substitution: 


4x, — 2x) + xy = 8 
— 3x, + 5x, = 3, 
— 2x, = 6. 


4x, + 


Solve the set of equations in Exercise 4. 


Show that the following set of equations does not have a solution: 


3x, + 2x. — 4x, — 4x, = 10, 
yim ay tO ay = dy 
2x, + x2 — 3x3 = 16, 

— x, + 8x, -—5x,= 3, 


If the constant vector in Exercise 8 has 2, 3 
have an infinite set of solutions. 


, |, 3 as components, show that the equations 


Section 2.4 


10. 
1y 


p12. 


Solve Exercise 4 by Gaussian elimination. Carry four decimals in your work. 


Reorder the equations in Exercise 4 by putting the first equation last, and then solve using 
the Gauss—Jordan method. Are your answers the same as in Exercise 10? 


a) Solve the system 


2.51x, + 1.48x) + 4.53x; = 0.05, 
1.48x, + 0.93x) — 1.30x3 = 1.03, 
2.68x, + 3.04x) — 1.48x, = —0.53, 


by Gaussian elimination, carrying just three significant digits and chopping. Do not inter- 
change rows. Note that one divides by a small coefficient in reducing the third equation. 
Solve the system again, using elimination with pivoting. Note now that the small coefficient 
is no longer a divisor. 

Substitute each set of answers in the original equations and observe that the left- and 
right-hand sides match better with the (b) answers. The solution, when seven places are 
carried to eliminate round-off, is 


x, = 1.45310, 


b 


c 


X_ = —1.58919, x; —0.27489. 


Solve the systems in Exercises 4, 5, and 12 by the Gauss—Jordan method. 
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m4. h\ By augmenting the coefficient matrix with each of the b-vectors simultaneously, solve Ax = 


—=—. b with multiple righthand sides, given that 


4 & =1 3 0 0 if 7 
El ey) aie _|o ae 

Aa ieel ee abe | Sieeea |e Ps meee ile 
1 1 0 5 =o 4 = 4 


15. a) Show that for a general n x (n + 1) matrix the Gaussian elimination algorithm, steps 1— 
4, would take at most n(n — 1)(2n — 1)/6 + n(n — 1) multiplications and/or divisions. 
You need to know that 


1+2+-+-+n=n(n + 1)/2 
and that 
12+ 22+ +++ +n = n(2n + 1)(n + 1)/6. 
b 


Also show that for the back-substitution part of this same algorithm, steps 4 and 5, the 
number of multiplications and/or divisions is n(n + 1)/2. 

Verify the statement in Section 2.4 that Gauss—Jordan takes about 50% more operations 
than Gaussian elimination for the specific case of a system of three equations. Add up 
the number of adds, multiplies, and divides (count a subtraction as an add). 


c 


16. Suppose we need to solve Az = b where the a,,, z,, b, are complex numbers. If the programming 
language being used does not allow for complex arithmetic, 

>a) Show how one may solve this problem using only real arithmetic. 

»b) Compare the amount of storage required using complex arithmetic versus solving 
this problem in real arithmetic. (Hint: The matrix A can be written as A = B + Ci where 
the matrices B, C have only real entries; similarly, we can write z = x + yi and b = 
p+ qi.) 


Section 2.5 


17. Use LU decomposition to solve the equations in Exercise 5. 
18. Use LU decomposition to solve the set of equations in Exercise 4. 


19, Solve the system in Exercise 12(a) by the LU method. Note that the small divisor again 
appears and causes round-off error to be severe. Pivoting can be used with this scheme; one 
must interchange corresponding rows in both the working matrix (LU) and the original matrix 


(A). 
20. a) Solve the tridiagonal system 
cd | 0 Oo 0 100 
=E 2 a 0 0 200 
oe Al 0 |x =] 200 
Oe OD A) =1 200 
Cree Dig 4 100 


by finding the LU equivalent to the coefficient matrix. Note how readily the operations 
proceed for this particular case. The coefficient matrix occurs in the solution of partial- 
differential equations as discussed in Chapter 8. 

»b) Solve the equations in part (a) with a new right-hand-side vector, 


b = (100, 0, 0, 0, 200)". 
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Section 2.6 


21, Which of these matrices are singular? 


-2 
Oe oe =) arn Ge 
A= |0 =I 4). B= 7 
6 3 2 =I 0 2 1 
: 4 2 6 
c) ul 0 =2 3 
3 1 1 4 
Cr lat iO 2 =I 
3 6 0 
22. a) Find values of x and y that make A singular: 
4 1 0 
A=|2 -1 2}, 
x y -l 
b) Find values for x and y that make A nonsingular. 


23. a 


Matrix A in Exercise 21 is singular, Do its rows form linearly independent vectors? Find 
the values for the a, in Eq. (2.8). 
»b) Repeat part (a) for the elements of A considered as column vectors. 


24. Do these sets of equations have a solution? Find a solution if it exists, 
a) [3x — 2y 
x= By 
ack ¥ 
3x 
b fl 10 
0 2. a 
10 ty" 
a ae 
ce) [ 2 1 
= 
6 2 
df 2. 
=1 10 
6 2 


»25. The Hilbert matrix is a classical case of the pathological situation called “ill-conditioning.” 
The 4 x 4 Hilbert matrix is 


id £ 
Lars 
Re ae ae 
2 3 #§ $ 
H = , 
his 
3 * 5 6 
Jud 
4 5 6 7. 


For the system Hx = b, with b? = [25/12, 77/60, 57/60, 319/420], the exact solution is 
x? = (L, 1,1, 1), 


a) Show that the matrix is ill-conditioned by showing that it is nearly singular, 
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b) Using only three significant digits (chopped) in your arithmetic, find the solution to 
Hx = b. Explain why the answers are so poor. 

c) Using only three significant digits, but rounding, again find the solution and compare it 
to that obtained in part (b). 

d) Using five significant digits in your arithmetic, again find the solution and compare it to 
those found in parts (b) and (c). 


Section 2.7 


26. 


»27. 


Find the determinant of the matrix 


0 et 
3 1 -4 
2 1 1 


by row operations to make it (a) upper-triangular, (b) lower-triangular. 
Find the determinant of the matrix 


w 
we 

noen 
bee 


3 


Invert the coefficient matrix in Exercise 5, and then use the inverse to generate the solution. 

Note that a symmetric matrix has a symmetric inverse. 

If the constant vector in Exercise 5 is changed to one with components (13, 11, —22), what 

is now the solution? Observe that the inverse obtained in Exercise 28 gives the answer readily. 

Attempt to find the inverse of the coefficient matrix in Exercise 8. Note that a singular matrix 

has no inverse. 

a) Find the determinant of the Hilbert matrix in Exercise 25. A small value of the determinant 
(when the matrix has elements of the order of unity) indicates ill-conditioning. 

»b) Find the inverse of the Hilbert matrix in Exercise 25. The inverse of an ill-conditioned 

matrix has some very large elements in comparison to the elements of the original matrix. 

Both Gaussian elimination and the Gauss—Jordan method can be adapted to invert a matrix. 

In Section 2.7, we say “it is more efficient to use the Gaussian elimination algorithm.” Verify 

this for the specific case of a 3 x 3 matrix by counting arithmetic operations for each method. 


Section 2.8 


33, 


Find the Euclidean norm of these vectors and matrices. 


a) x = [1.06, —2.15, 14.05, 0.0]. 
[1, 0, 0, 0}. 


14 0 0 
oF B16. 
oP. 0) 2 


f) Find the Euclidean norm of B?, of B°, of B+. 
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g) Find the norms of AB; of Az. 
h) Does the triangle inequality of Eq. (2.9) hold for A + B? for B + B? for x + y? 


34. Find the Euclidean norm of the Hilbert matrix of Exercise 25, 
»35. Find the Euclidean norm of the inverse of the Hilbert matrix of order 4 (Exercise 31). 


Section 2.9* 
36. Consider the system Ax = b, where 


3.01 6.03 19g tL 
A= 11,27 4.16 1.23), b= 
0,987 —4.81 9.34 


>a) Using double precision (or a calculator with 10 or more digits of accuracy), solve for x 
b) Solve the system using three-digit (chopped) arithmetic for each arithmetic operation; call 
this solution x. 
c) Compare x and ¥ and compute e = x — ¥. What is |e||,? 
d) Is the system ill-conditioned? What evidence is there to support your conclusion? 


37. Repeat Exercise 36, except change the element a3; to —9.34 


»38, Suppose, in Exercise 36, that uncertainties of measurement give slight changes in some of 
the elements of A. Specifically, suppose a,, is 3.00 instead of 3.01 and ay, is 0.99 instead 
of 0.987. What change does this cause in the solution vector (using precise arithmetic)? 


39. Compute the residuals for the imperfect solutions in 36(b) and 37(b). Use double precision 
in this computation. 


40. What are the condition numbers of the coefficient matrices in Exercises 36 and 37? Use the 
|-norms. 


»41. Verify Eq. (2.14), using the results in Exercises 36, 39, and 40 

42. Verify Eq. (2.14), using the results in Exercises 37, 39, and 40. 

43. Verify Eq. (2.15) for the results of Exercise 38 

44. Apply iterative improvement to the imperfect solution of Exercise 36 
»45, Apply iterative improvement to the imperfect solution of Exercise 37 


Section 2.10 


] . ; 
| Solve Exercise 5 by the Jacobi method, beginning with the initial vector (0, 0, 0), Compare 
the rate of convergence when Gauss-Seidel is used with the same starting vector 


47. Solve Exercise 5 by Gauss—Seidel iteration, beginning with approximate solution (2, 2, ~1) 
»48. | The pair of equations 
x, + 2x = 3, 
3x, + x = 4, 
can be rearranged to give x, = 3 — 2x5, x, = 4 — 3x), Apply the Jacobi method to this 


rearrangement, beginning with a vector very close to the solution: x") = (1.01, 1.01)" and 
observe divergence. Now apply Gauss-Seidel. Which method diverges more rapidly? 


49. Solve Exercise 20(a) by Gauss—Seidel iteration. 


*In certain exercises (36, 37, 38, 44, 45), imperfect solutions will result because low-precision arithmetic is 
used when the condition number is large. This exaggerates the condition number problem 
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Section 2.11 
»50. Beginning with (0, 070), use relaxation to solve the system 
iy — Set ag TT) | 
2x; + x) — 8x; = —15, 
xX, -—7x,+ x43= 10. 
Sl. Solve the system in Exercise 4 by relaxation. 


52. Relaxation is especially well adapted to problems like Exercise 20. Solve by the relaxation 
method, starting with the vector (45, 80, 90, 80, 45) which one obtains by inspection. 


53. Solve the system of equations in Exercise 20 by Gauss-Seidel iteration with overrelaxation 
(Eq, (2.13)). Vary the overrelaxation factor w to find its optimum value. (You will probably 
want to write a computer program for this.) 


Section 2.12 


»54.\ Find the two intersections nearest the origin of the two curves: x2 + x — y2 = 1 
\and y — sin x? = 0. Use the method of iteration. 
| 


55. | Solve the system 


ae ee 
by iteration to obtain the solution near (2.5, 0.2, 1.6) 
56. Solve Exercise 54 by Newton's method. 

»57. Solve by using Newton's method: 
3 4 3y? = 21, 

+ 2= 0. 

Make sketches of the graphs to locate approximate values of the intersections. 

58. Apply Eq. (2.39) to compute partials and solve this system by Newton's method: 
= 1.34, 
: 0,09, 
e-—e&+z=041. 


There should be a solution near (1, 1, 1). 

59, At the end of Section 2.12, it is suggested that it would be more efficient to avoid recomputing 
the partials at each step of Newton’s method for a nonlinear system, doing it only after each 
nth step when there are n equations. Redo Exercises 54 and 57 using this modification. 
Compare the rate of convergence with that when the partials are recomputed at each step. 
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60. In considering the movement of space vehicles, it is frequently necessary to transform coor- 
dinate systems. The standard inertial coordinate system has the N-axis pointed north, the E- 
axis pointed east, and the D-axis pointed toward the center of the earth. A second system is 
the vehicle's local coordinate system (with the i-axis straight ahead of the vehicle, the j-axis 


Figure 2.14 


6l. 


63. 
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to the right, and the k-axis downward). We can transform the vector whose local coordinates 
are (i, j, k) to the inertial system by multiplying by transformation matrices: 


n cosa —sina O][ cosb 0 sin b][1 0 0 i 
e|=|sina cosa 0 0 1 0 0 cose —sine|}j}. 
d 0 0 IJL-sinb 0 cos bjLO sine cose jk 


Transform the vector (2.06, —2.44, —0.47)" to the inertial system if a = 27°, b = 5°, 
c= 72°. 

a) Exercise 25 shows the pattern for Hilbert matrices. Find the condition number of the 
9 * 9 Hilbert matrix. 

Suppose we have a system of nine linear equations whose coefficients are the 9 * 9 
Hilbert matrix. Find the right-hand side (the b-vector) if the solution vector has ones for 
all components. Now increase the value of the first component of the b-vector by 1% and 
find the solution to the perturbed system. Which component of the solution vector is most 
changed? 


b 


In this statically determinate truss with pin joints, the tension F; in each member can be 
obtained from the following matrix equation (the equations result from setting the sum of all 
forces acting horizontally or vertically at each pin equal to zero). (See Fig. 2.14.) 


0.7071 0 0 -1 -0,.8660 0 0 0 0 0 
0,7071 0 1 1) 0.5 Oo 0 0 0 — 1000 
0 i Oo 0 0 1 0 0 0 0 
0 oO -1 0 0 oO 0 0 0 0 

0 0 0 0 0 0 1 0 0.7071 |F = 500 

0 0 0 | ) o 0 0 -0.7071 0 
0 0 0 0 0.8660 | 0 =1 0 0 
0 0 0 0 -0.5 0 -1 0 0 —500 
0 o 0 0 0 Oo 0 i 0.7071 0 


a) Solve the system by Gauss elimination, 

b) The matrix is quite sparse, so it is a candidate for Gauss-Seidel iteration. Can it be 
arranged into a diagonally dominant form? Is the system convergent for a starting vector 
with all elements 0? 


Mass spectrometry analysis gives a series of peak height readings for various ion masses. 
For each peak, the height h, is contributed to by the various constituents. These make different 
contributions c,, per unit concentration p, so that the relation 
1 
= 5 2 
h= a cyPy 

holds, with n being the number of components present. Carnahan (1964) gives the values 
shown in Table 2.3 for cy 
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Table 2.3 


Figure 2.15 


mee Component 

number CH, CoH, CoH, CHg CHe 
1 0.165 0.202 0.317 0.234 0.182 
2 21.7 0,862 0.062 0.073 0.131 
3 22.35 13.05 4.420 6.001 
4 11.28 0 1.110 
5 9.850 1,684 
6 15.94 


If a sample had measured peak heights of hy = 5.20, hy = 61.7, hy = 149.2, hy = 79.4, 
hs = 89.3, and hg = 69.3, calculate the values of p, for each component. The total of all 
the p, values was 21.53. 


64. The truss in Problem 62 is called statically determinate because nine linearly independent 
equations can be established to relate the nine unknown values of the tensions in the members. 
If an additional cross brace is added, as sketched in Fig. 2.15, we have ten unknowns but 
still only nine equations can be written; we now have a statically indeterminate system. 
Consideration of the stretching or compression of the members permits a solution, however. 
We need to solve a set of equations that gives the displacements x of each pin, which is of 
the form ASA’x = P. We then get the tensions f by matrix multiplication: SA7x = f. The 
necessary matrices and vectors are 


0.7071 0 O -1 -0.8660 0 O 0 0 0 
0.7071 0 1 0 05 0 0 0 0 0 
0 1 0 0 0 -1 0 0 0 —0.8660 
0 0-1 0 0 0 0 0 0 -0.5 
A=| 0 0 0 0 0 0 1 O- 0.7071 05 
0 D a0 a 0 0 0 0 ~0.7071 0.8660 
0 0 0 0 08660 1 O -1 0 0 
0 0 0 0 -05 0-1 0 0 0 
0 o 0 0 0 0 0 1. 0.7071 0 


S is a diagonal matrix with values (from upper left to lower night) of 


4255, 6000, 6000, 3670, 3000, 
3670, 6000, 6000, 4255, 3000. 


(These quantities are the values of aE/L, where a is the cross-sectional area of a member, 
E is the Young’s modulus for the material, and L is the length.) 


Figure 2.16 


66. 


67. 


68. 
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Solve the system of equations to determine the values of f for each of three loading 
vectors: 
P, = (0, —1000, 0, 0, 500, 0, 0, —500, 0)"; 
P, = [{1000, 0, 0, —500, 0, 1000, 0, —500, 0)"; 


P, = [0, 0, 0, —500, 0, 0, 0, —500, OJ". 


For turbulent flow of fluids in an interconnected network (see Fig. 2.16), the flow rate V 
from one node to another is about proportional to the square root of the difference in pressures 
at the nodes. (Thus fluid flow differs from flow of electrical current in a network in 
that nonlinear equations result.) For the conduits in Fig. 2.16, we are to find the 
pressure at each node. The values of b represent conductance factors in the relation v,. = 
bi (p, — p))'?. These equations can be set up for the pressures at each node; 


At node 1; 0.3500 — p, = 0.2Vp, — pp + 0.1Vp, — p33 
node 2; 0.2Vp, — py = 0.1Vp, — py + 0.2Vp — psi 
node 3; 0.1Vp, — ps = 0.2Vp — py + 0.1V ps — pai 
node 4: 0.1Vp) — py + 0.1Vp, — py = 0.2Vp, — 0. 


p = 500,b= 03 ® 


For a tridiagonal matrix of coefficients, the subroutine TRIDG is much more economical of 
memory space than the subroutine LUD, It should also be much faster in execution. Analyze 
each program to determine how the numbers of arithmetic operations vary with size of the 
matrix. Perform tests to find how the execution times with your computer system compare 
for systems of various sizes. 

Make similar comparison between subroutine LUD and LUDCMP as described in Problem 
66. Be sure to consider both sparse and dense matrices. 


Compare the execution times to obtain the solution to some large sparse systems when using 
subroutine GSITRN in comparison to LUD or LUDCMP. Vary the TOL criterion that is input 
to find the effect of this parameter: 


Interpolating Polynomials 


3.0 CONTENTS OF THIS CHAPTER 


180 


Chapter 3 tells you how to estimate values for a function at points intermediate 
to those where its values are known. The techniques that are covered also have 
application in numerical integration and differentiation and in solving differential 
equations by numerical methods. 


3.1 


3.2 


33 


3.6 


AN INTERPOLATION PROBLEM 

Describes a typical situation where interpolation is necessary 
INTERPOLATING POLYNOMIALS 

Is the most widely used way to approximate unknown functions so that interpolation 
can be done; the Lagrangian polynomial, one way to construct the polynomial, is 
explained 

DIVIDED DIFFERENCES 

Are more efficient for constructing interpolating polynomials and allow you to 
readily change the degree of the polynomial 

EVENLY SPACED DATA 

Allow you to use simpler procedures for constructing the polynomials 

OTHER INTERPOLATING POLYNOMIALS 

Gives formulas for the many different forms of interpolating polynomials; they 
have the names of famous mathematicians: Newton, Gauss, Bessel, and Stirling 
ERROR TERMS AND ERROR OF INTERPOLATION 


Points out that these methods, like all numerical analysis procedures, are subject 
to error; you are shown how to estimate the magnitude of the error and how to 
minimize it 


3.1 
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3.7. DERIVATION OF FORMULAS BY SYMBOLIC METHODS 
Tells about a technique that can be applied not only to derive the formulas of this 
chapter but that have more general application 
3.8 INVERSE INTERPOLATION 
Shows how you can use interpolating polynomials to find the value of the inde- 
pendent variable that corresponds to a given value for the function, the reverse of 
the previous problem 
3.9 INTERPOLATING WITH A CUBIC SPLINE 
Develops a newer, very important technique that avoids some of the problems 
associated with ordinary interpolating polynomials 
3.10 OTHER SPLINES 
Tells you how to construct Bezier curves and B-splines; these are modern techniques 
for constructing curves that have many important applications, such as in computer 
graphics and engineering design 
3.11 POLYNOMIAL APPROXIMATION OF SURFACES 
Teaches how to apply polynomials and splines to the rather difficult problem of 
interpolation when there are two or more independent variables 
3.12. CHAPTER SUMMARY 
Is the usual review section to check your understanding 
3.13 COMPUTER PROGRAMS 
Gives example programs that implement interpolation 


AN INTERPOLATION PROBLEM 


Rita Laski and Ed Baker were at lunch in the cafeteria of Ruscon Engineering, Both had 
just started summer jobs and were excited at the prospect of finding out how their college 
training really applied in industry. Rita was explaining her first assignment 

“They have huge amounts of data on the performance of that new rocket. Telemetry 
signals are received every 10 sec, giving the position of the rocket as well as other 
information. My boss has asked me to look into how we can determine the position at 
intermediate times. In essence, it’s a kind of interpolation problem.” 

“T see,” said Ed, “something like what we did when we studied log tables in algebra, 
There we calculated the logs for intermediate values by assuming that a straight line went 
between the two values from the table.” 

“Exactly,” Rita replied, “except that it isn’t appropriate to assume that the positions 
are linear with time. Typically, the points look something like this when they are plotted.” 
She drew on a paper napkin to illustrate her point. “As you can see, sometimes a signal 
is missed, like on the third point.” 

“You can see that a straight line is just impossible if we want to fit the data. And 
we can’t just draw curves like I’ve done here because we want to get the intermediate 
values in the computer, and a computer can’t look at a graph to find the values, What 
really bothers me is that Mr. Johnson, my boss, told me that we also must be able to get 


182 CHAPTER 3: INTERPOLATING POLYNOMIALS 


Position 


1 l 1 1 1 
| — —CiTime 


10 sec 


the intermediate values even when there is a maximum or minimum to the curve. Besides 
all this, we need a method of good efficiency because there are so many cases to handle.” 

In this chapter we will explore efficient techniques to interpolate, particularly for 
those situations where the data are far from linear. The principle that will be used is to 
fit a polynomial curve to the points. The problem is one of interest to many applications. 
Much of the development comes from work done by Newton and Kepler as they analyzed 
data on the positions of stars and planets. 

In Chapters 1 and 2, we examined the question: “Given an explicit function of the 
independent variable x, what is the value of x corresponding to a certain value of the 
function?” In Chapter 1, x was a simple variable. In Chapter 2, x was a vector. We now 
want to consider a question somewhat the reverse of this: “Given values of an unknown 
function corresponding to certain values of x, what is the behavior of the function?” We 
would like to answer the question “What is the function?” but this is always impossible 
to determine with a limited amount of data. 

Our purpose in determining the behavior of the function, as evidenced by the sample 
of data pairs [x, f(x)], is severalfold. We will wish to approximate other values of the 
function at values of x not tabulated (interpolation or extrapolation) and to estimate the 
integral of f(x) and its derivative. The latter objectives will lead us into ways of solving 
ordinary- and partial-differential equations. 

The strategy we will use in approximating unknown values of the function is straight- 
forward. We will find a polynomial that fits a selected set of points (x,, f(x;)) and assume 
that the polynomial and the function behave nearly the same over the interval in question. 
Values of the polynomial then should be reasonable estimates of the values of the unknown 
function. When the polynomial is of the first degree, this leads to the familiar linear 
interpolation. We will be interested in polynomials of degree higher than the first, so we 
can approximate functions that are far from linear, or so we can get good values from a 
table with wider spacing. 

Interpolating in tables of data was once a very important topic. Today, with computers 
and calculators that recalculate values for most functions at electronic speeds, we look 
up values from published tables less frequently. However, if computer memory becomes 
inexpensive enough, it may be preferable to interpolate from tabulated values in computers 
rather than compute. For instance, on most hand calculators the trigonometric functions 
are evaluated with stored values through a method called CORDIC. Similarly, the log- 
arithmic functions, like In(x), are evaluated using stored values rather than a Taylor series. 
In both these cases the trade-off is to use cheap memory (ROM) to save computing time. 
Finally, a major reason for discussing this subject is to lay the groundwork for numerical 
integration and differentiation. 


3.2 
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LAGRANGIAN POLYNOMIALS 


If we desire to find a polynomial that passes through the same points as our unknown 
function, we could set up a system of equations involving the coefficients of the poly- 
nomial. For example, suppose we want to fit a cubic to these data: 


x f(x) 
32 22.0 
2.7 
1.0 14.2 
4.8 38.3 
5.6 51.7 


First, we need to select four points to determine our polynomial. (The maximum degree 
of the polynomial is always one less than the number of points.) Suppose we choose the 
first four points. If the cubic is ax? + bx? + ex + d, we can write four equations involving 
the unknown coefficients a, b, c, and d: 


when x = 3.2: a(3.2)? + b(3.2)? + c(3.2) + d = 22.0, 
if x = 2.7; a(2.7)3 + b(2.7)? + c(2.7) + d = 17.8, 
if x = 1.0: a(1.0)? + b(1.0)? + c(1.0) + d = 14.2, 
if x = 4.8: a(4.8)° + b(4.8)? + c(4.8) + d = 38.3. 


Solving these equations by the methods of the previous chapter gives us the poly- 
nomial. We can then estimate the values of the function at some value of x, say x = 3.0, 
by substituting 3.0 for x in the polynomial. 

For this example, the set of equations gives 


a= -0.5275 
b= 6.4952 
@¢= 16,117 
d= 24,3499 


and our polynomial is 
—0.5275x3 + 6.4952x? — 16,1177x + 24.3499, 


At x = 3.0, the estimated value is 20.21. 

We seek a better and simpler way of finding such interpolating polynomials. The 
above procedure is awkward, especially if we want a new polynomial that is also made 
to fit at the point (5.6, 51.7), or if we want to see what difference it would make to use 
a quadratic instead of a cubic. Furthermore, this technique leads to an ill-conditioned 
system of equations.* 


“For this example, the system of four equations has a condition number of approximately 555. Adding the last 


point to solve for the coefficients of a quartic polynomial causes the condition number to jump to near 100,000! 
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We will first look at another very straightforward approach—the Lagrangian poly- 
nomial. The Lagrangias-polynomial is perhaps the simplest way to exhibit the existence 
of a polynomial for interpolation with unevenly spaced dara. Data where the x-walues are 
not equispaced often occur as the result of experimental observations or when historical 


data are examined. 
Suppose we have a table of data. of x- and f(x)-values: 
x fe 
% fo 
bal fi 
m Sh 
&B &f 


Here we do not assume uniform spacing between the x-values. nor do we need the 
x-values arranged in a particular order. The x-values must all be distinct, however. 
Through these four data pairs we can pass a cubic. The Lagrangian form for this is 


(x — 2)) — xe — es) 
(xq — % xy — 22)l%y — x 

(& — x) — x)O— x) , .  — ie — 4) — BS) 
(ry — xg)Oe, — x), — x3)? * Cr — yl — 2, — YF 


(x — Xo)ix — te — 35) . 
&, - = — xy! 
(X, — Xqhixy — 2y)ley — 45) 


Px) = 


+ 


Note thar it is made up of four terms. each of which is a cubic im x; hence the sum is 2 
cubic. The pattern of each term is to form the numerator as a product of linear factors 
of the form (x — x,). omitting one x, in each term. the omitted value being used t form 
the denominator by replacing x in each of the numerator factors. In each term, we multiply 
by the f corresponding to the x; omitted im the numerator factors. The Lagrangian poly- 
nomial for other degrees of interpolating polynomials employs this same pattern of forming 
@ sum of polynomials all of the desired degree. 

It is easy to see that the Lagrangian polynomial does in fact pass through each of 
the points used in its construction. For example, in the above equation for P,(x)_ let 
x = x). All terms but the third vanish because of a zero numerator, while the thind 
term becomes just (1). Hence P(x.) = fh. Sumilarly, P(x,) = ff for i = O. 1. 3. 


EXAMPLE Fit a cubic through the first four points of the preceding table and use it to find the 
interpolated value for x = 3.0. 


G.0 — 2.7)G.0 — 1.0)G.0 — 4.8) GO — 3.21G.0 — 1.0G.0 — 4.9) 


a = 722 
PG.) = G2 —-27G2—10G2—49) 9 * @7—3907-Lo9e7—4a 
G.0 — 3.26.0 —2.7G.0—48) G.0 — 3293.0 — 2.7G.0— 10) 
(L0—3.21.0-270.0—48)"" * @s—3293— 2748 — 1H 


Carrying out the arithmetic. P{3.0)= 20.21. = 
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Observe that we get the same result as before. The arithmetic in this method is 
tedious, although hand calculators are convenient for this type of computation. A computer 
program at the end of this chapter implements the method. 

An interpolating polynomial, while passing through the points used in its construction, 
does not, in general, give exactly correct values when used for interpolation. The reason 
for this is that the underlying relationship is often not a polynomial of the same degree. 
We are therefore interested in the error of interpolation. 

We will develop an expression for the error of P,(x), an nth-degree interpolating 
polynomial. We write the error function in a form that has the known property that it is 
zero at the n + | points, from x9 through x,,, where P,(x) and f(x) are the same. We call 
this function E(x): 


E(x) = f(x) — P(x) = (x — xox — x4)... (& — XQ). 


The n + | linear factors give E(x) the zeros we know it must have and g(x) accounts 
for its behavior at values other than at xo, x ‘,- Obviously, f(x) — P,(x) — 
E(x) = 0, so 


Sy = BA) = &— 2H = A iaeaile — %) RQ) = 0. (3.1) 


In order to determine g(x), we now use the interesting mathematical device of con- 
structing an auxiliary function (the reason for its special form becomes apparent as the 
development proceeds). We call this auxiliary function W(), and define it as 


Wit) = f() — P(t) — (tf — xo\t — x)  @— xe). 


Note in particular that x has nor been replaced by 1 in the g(x) portion. (W is really a 
function of both ¢ and x, but we are only interested in variations of t.) We now examine 
the zeros of W(r). 

Certainly at ¢ = x9, %),. . . , X,, the W function is zero (n + | times), but it is also 
zero if ¢ = x by virtue of Eq. (3.1). There are then a total of n + 2 values of ¢ that make 
W(t) = 0. We now impose the necessary requirements on W(r) for the /aw of mean value 
to hold. W(r) must be continuous and differentiable. If this is so, there is a zero to its 
derivative W'(r) between each of the n + 2 zeros of W(r), a total of n + 1 zeros. 
If W(t) exists, and we suppose it does, there will be n zeros of W(r), and likewise 
n — 1 zeros of W’"(t), and so on, until we reach W'"*(t), which must have at least 
one zero in the interval that has xo, x,,, or x as endpoints. Call this value of 1 = & We 
then have 


n+l 


1 : 
were) = 0= Gani — Pl) = C= 20)0 0 — BECO licz 
= fle) -— 0 - (n + Die(x). (3.2) 


The right-hand side of Eq. (3.2) occurs because of the following arguments. The 
(n + I)st derivative of f(r), evaluated at t = €, is obvious. The (n + 1)st derivative of 
P,(t) is zero because every time any polynomial is differentiated its degree is reduced by 
one, so that the nth derivative is of degree zero (a constant) and its (n + 1)st derivative 
is zero. We apply the same argument to the (n + 1)st degree polynomial in r that occurs 
in the last term—its (n + 1)st derivative is a constant and this constant results from the 
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3.3 


1”*! term and is (n + 1)!. Of course g(x) is independent of t and goes through the 
differentiations unchanged. The form of g(x) is now apparent: 


a far é) 
B(x) = GR, | Ca. hs 3: 
The conditions on W(r) that are required for this development (continuous and dif- 
ferentiable n + 1 times) will be met if f(x) has these same properties, because P,(x) is 
continuous and differentiable. We now have our error term: 


(n+1) 
EQ) = &@ —2ap@— %) ... .& — woe. (3.3) 
with & on the smallest interval that contains {x, x9, x}, .-. , cals 


The expression for error given in Eq. (3.3) is interesting but is not always extremely 
useful. This is because the actual function that generates the x,, f; values is often unknown; 
we obviously then do not know its (nm + 1)st derivative. We can conclude, however, that 
if the function is “smooth,” a low-degree polynomial should work satisfactorily. (A smooth 
function has small higher derivatives.) On the other hand, a “rough” function can be 
expected to have larger errors when interpolated. We can also conclude that extrapolation 
(applying the interpolating polynomial outside the range of x-values employed to construct 
it) will have larger errors than for interpolation. 


DIVIDED DIFFERENCES 


There are two problems when the Lagrangian polynomial method is used for interpolation. 
First, there are more arithmetic operations than for the divided-difference method we now 
discuss. More importantly, if we desire to add or subtract a point from the set used to 
construct the polynomial, we essentially have to start over in the computations. The 
divided-difference method permits one to reuse the previous computations. 

Our treatment of divided-difference tables assumes that a function f(x) is known at 
several distinct values for x: 


X fo 
4 Si 
» fh 
x fs 


We do not assume that the x’s are equispaced nor even that the values are arranged in 
any particular order. 
Consider the nth-degree polynomial: 


P(x) = ag + (x — xp)ay + (X — XpM(x — Jaz + ° °° 
FiGe = xe = x)))= + 1 Se, (3.4) 


Table 3.1(a) 
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If we chose the a; so that P,(x) equals f(x) at the n + 1 known points, x9, X), 6 6. 5 Xns 
then P,(x) is an interpolating polynomial. We will show that the a; are readily determined 
by using what are called the divided differences of the tabulated values. 

A special notion is used for divided differences: 


fi = fo 


xX — Xo 


Slxo. x1) = 


is called the first divided difference between xp and x. 


-B=>h 
FLX, %2) = a 
is the first divided difference between x, and x>, 
In general, 
f-fs 


Lp= By 


fxs. &) = 


is the first divided difference between x, and x,. (Observe that the ordering of the points 
is immaterial: 


Second- and higher-order divided differences are defined in terms of lower-order 
differences. For example: 


_ fix, Xo) = flo. 41) 
x2 — X : 


ji fle) Xp. + +s 4) = flo. M.-H) 
Xx; — Xo : 


SLX. 5X2) 


flXo1 ¥1. 6 Xj 


The concept is even extended to a zero-order difference: 
Slxs1 = fs- 


Using this notation, a divided difference table, in symbolic form, is Table 3.1(a). 


x fi fle Kiss] 

a 2 Flx, m1] 

a i Flay, 2] Fl. 41 X98 
° 2 flr, x] fle. a. % 8 
3 3 


h Sl, x4] 
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Table 3.1(b) 


A table with specific numerical values might be Table 3.1(b) (with the differences rounded 
to three decimal places}.These are the same data as for the table in Section 3.2. 


x ff Fix, X11 fx; --- 5 +2] fli, ---. Kies) fin - - - Hiv) 

3.2 22.0 

27 «178 an 2.856 Sean 

10 14.2 ; 2.012 ' 0.256 
6.342 0.0865 

48 38.3 AR 2.263 

5.6 9 51.7 . 


We are now ready to establish that the a; of Eq. (3.4) are given by these divided 
differences. We write Eq. (3.4) with x = x, x = x;,..., x =x,, giving 


with x=Xp: PXxo) = a, 
X= xX: Px) = ao + (% — XQ)a}, 
X= Xx P(x) = ay + (xp — Xo), + (x2 — Xo)H2 — ¥1)a2, 


X= Xp PXq) = Ag + Hq — XNA + Hq — XOX, — Mag + °° 
+ (X_ — Xp). - - in — Xn-VEn- 


If P,(x) is to be an interpolating polynomial, it must match the table for all n + 1 entries: 
P(x) =f; for i=0,1,2,..., n. 


If the P,(x;) in each equation is replaced by f;, we get a triangular system, and each a, 
can be computed in turn. 
From the first equation, 


4 = fo=flxo] makes P,(%9) = fo- 


If a, = f[xg, x;], then 


PAxy) = fo + Oy — with = fi. 
If ay = f[Xo, 1, x2], then 
Pia) = fy + (02 — xo oa 
+ (x) — x9) — xy = fA/ — L = i = fo)/(x, = x0) 


= fr. 
One can show in similar fashion that each P,(x;) will equal f; if a; = f[xp. x). - - . » %i]- 


EXAMPLE 


3.4 
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Write the interpolating polynomial of degree 3 that fits the above table at all points from 
Xo = 3.2 tox3 = 4.8. 


P(x) = 22.0 + 8.400% — 3.2) + 2.856(x — 3.2)( — 2.7) 
= 05286 — 3i2)% = 2.7)G = 1.0); 


What is the fourth-degree polynomial that fits at all five points? We only have to add 
one more term to P3(x); 


P,(x) = Px(x) + 0.256(x — 3.2)(x — 2.7)(x — 1.0)(x — 4.8). 


When this method is used for interpolation, one should observe that nested multipli- 
cation can be used to cut down on the number of arithmetic operations, for example, for 
x=3: 


Px(3) = {{—0.528(3 — 1.0) + 2.586](3 — 2.7) + 8.400}(3 — 3.2) + 22.0. « 


If we compute the interpolated value at x = 3.0 for each of the third-degree poly- 
nomials in Sections 3.2 and 3.3, we get the same result, P;(3.0) = 20.21. This is not 
surprising, because all third-degree polynomials that pass through the same four points 
are identical. They may be of different form, but they can all be reduced to the same 
form. 

This seems intuitively true, since there are n + | constants in the polynomial, and 
the n + | data pairs are exactly enough to determine them. More formally we can reason 
thus, the proof being by contradiction: 

Suppose there are two different polynomials of degree n that are alike at the n + | 
points. Call these P,(x) and Q,(x) and write their difference: 


D(x) = P(x) — Q,(x), 


where D(x) is a polynomial of at most degree n. But since P and Q match at the n + | 
pairs of points, their difference D(x) is equal to zero for all n + 1 of these x-values; that 
is, it is a polynomial of degree n at most but has n + 1 distinct zeros. This is impossible 
unless D(x) is identically zero. Hence P,(x) and Q,(x) are not different—they must be 
the same polynomial. 

One important consequence of this uniqueness property of interpolating polynomials 
is that their error terms are also identical (though we may want to express them in different 
forms). We then already know the error term for an interpolating polynomial derived 
from a table of divided differences. It is precisely that expression we derived in Section 
3.2, Eq. (3.3). 


EVENLY SPACED DATA 


The problem of interpolation from tabulated data is considerably simplified if the values: 
of the function are given at evenly spaced intervals of the independent variable. It is 
necessary here to arrange the data in a table with x-values in ascending order. In addition 
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to columns for x and f(x), we will tabulate differences of the functional values. Table 
3.2 is a typical difference table. 


Table 3.2 A difference table 


x fx) Afi) Lf (x) AM f(x) M(x) 
0.0 0.000 
0.203 
0.2 0,203 0.017 
: 2 
0.4 0.423 Roe 0.041 Dae 0.020 
0.6 0.684 a 0.085 ; 0.052 
0.346 0.096 
0.8 1.030 0.527 0.181 0.307 0.211 
1.0 1.557 101 0.488 : 
12.2572 PONS 


Each of the columns to the right of the f(x) column is computed by calculating the 
difference between two values in the column to its left. 

Symbols that represent the entries in a difference table will be helpful in using the 
table of differences to determine coefficients in interpolating polynomials. It is conven- 
tional to let the letter # stand for the tiniform difference in the x-values, h = Ax. Using 
subscripts to represent the order of the x and f(x) values, we define the first differences 
of the function as 


Afo = fi — fos 4h =h-fy 4h =f- hy see Afi = fini — fi. 
The second- and higher-order differences are similarly defined:* 


Mf, = A(Af) = (fh — fi) = 4f - Af = (h-f) - (h - fd 
=fs— 2h +f 
Mf, = fier — fier + fir (3.5) 
Mf, = A(Wf,) = fs — 3p + 3p — fie 
MY, = fies — fire + iar — Sir 


n(n — 1) n(n — 1)(n — 2) 
A’fe = fitn — Misn-1 + 3p Sitn-2 — 3! Siena t 7° 


In Eqs. (3.5) and throughout this chapter, a subscript on f indicates the x-value at 
which it is evaluated: namely, f; = f(x;). The pattern of the coefficients in Eqs. (3.5) is 
the familiar array of coefficients in the binomial expansion. This fact we can prove most 
readily by symbolic methods, which we postpone until a later section. The second- and 
higher-order differences are generally obtained by differencing the previous differences, 


*The differences that we define here are called forward differences. Some texts define differences that are called 
backward differences (written as Vf,, defined as Vf, = f, — f,-;) and central differences (written 6f,), Our 
treatment uses only the forward difference. 


Table 3.3 


Table 3.4 
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but Eq. (3.5) shows how any difference can be calculated directly from the functional 
values. 

Table 3.3 illustrates the formation of a difference table using symbolic representation 
of the entries. While it might be natural always to call the first x in the table x9, we will 
frequently wish to refer to only a portion of the data pairs of the table, so the “first” x 
loses its significance. We then arbitrarily choose the origin for the subscripted variable, 
because by the use of negative values we can refer to x-entries that precede x». We will 
set the variable » equal to the subscript on x; it then serves to index the x-values, 

In hand computation of a difference table, great care should be exercised to avoid 
arithmetic errors in the subtractions—the fact that we subtract the upper entry from the 
lower adds a real source of confusion. One of the best ways to check for mistakes is to 
add the sum of the numbers in each column to the top entry in the column to its left. 
This sum should equal the bottom entry in the column to the left. 

Since each entry in the difference table is the difference of a pair of numbers in the 
column to its left, one could recompute one of this pair if it should be erased. As a 
consequence of this, the entire table could be reproduced, given only one value in each 
column, if the table is extended to the highest possible order of differences, 

When f(x) behaves like a polynomial for the set of data given, the difference table 
has special properties. In Table 3.4 a function is tabulated over the domain x = 1 to 
x = 6, and f(x) obviously behaves the same as x3. (Note carefully that this does not 
imply that f(x) = x3; the value of f(x) at x = 7 might well be 17 instead of 7° = 343. 
We only know the values of f(x) as given in the table.) 


Difference table, using symbols 


s x fix) Af f “ef AY 
=2 X_ f-2 ‘ 
= 7 7 Af 2 
aj ih - Ay = Mf 
fo i Ah one he 
1 JV JO -l 
2 n hs Pi af aif 
3 xy fy Af LA 
4 x ts 


x fix) nV vy “yf 
1 1 

2 8 Re 12 : 0 
3 2 a 18 ; 0 
4 64 5H 24 : 0 
5 2 30 


a 
ve 
an 
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We observe that the third differences are constant. Consequently, the fourth and all 
higher differences will bezero. The fact that the nth-order differences of any nth-degree 
polynomial are constant is readily shown. To prove that these nth-order differences are 
constant, we first examine the differences of ax”: 


A(ax") = a(x + h)" — ax" 
= ax" + anx"""h + +--+ + ah" — ax" 
= (anh)x""! + terms of lower degree in x, 
A(anh x""') = an(n — 1)h? x"? + terms of lower degree. 


Noting that every time a difference is taken the leading term has a power of x one 
less than originally and that the difference of a constant term will be zero, we have for 
a polynomial 

AP,x) = A(ayx” + agx"™!) + +++ + ayx + ayy) 


= aynhx"~! + terms of lower degree, 


LPAx) = ayn(n — 1h? x"? + terms of lower degree, 


M'P(x) = ayn(n — 1)(n — 2)... (DA"x"™ = aynth". (3.6) 


This shows not only that the nth difference is a constant, but that its value is ayn!h". For 
P,(x) = x3, 


SPXx) = (DB(D3 = 6 when h = 1. 


This is exactly what was found in Table 3.4. Note a similarity between the nth difference 
and the nth derivative of P,(x): 


P(x) = ant. 


Note also that there are similarities between these differences in an evenly spaced 
table and the divided differences we computed in Section 3.3. In fact the first divided 
differences computed for Table 3.1(a) when multiplied by Ax are exactly the same as 
the column of Af in Table 3.4. Similarly, the higher-order divided differences are related 
to the higher-order differences by simple multipliers. In an exercise, you are asked to 
show this relationship. 

When the function that is tabulated behaves like a polynomial (and this we can tell 
by observing that its nth-order differences are constant or nearly so), we can approximate 
it by the polynomial it resembles. Our problem is to find the simplest means of writing 
the nth-degree polynomial that passes through n + 1 pairs of points, (4. fj), i = 0. 
hy eas n. Note that such a polynomial is unique—there is only one polynomial of 
degree n passing through n + 1 points, as we have previously observed. 

Perhaps the easiest way to write a polynomial that passes through a group of equi- 
spaced points is the Newton—Gregory forward polynomial: 


EXAMPLE 
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s(s — Is = 
3! 


: (F= Dio. 2 
Fxs) = fo + sAfo + : ai xf, ar Daf Pea 


= fo + (tam + (3)a% + (3)a% + ({)a%me +o an 


In this equation we have used the notation (‘), the number of combinations of s 
things taken n at a time, which is the same as the factorial ratios also shown. Referring 
to Table 3.3, we now observe that P,(x) does match the table at all the data pairs (x, fi), 
i=0,1,2,..., n. When s = 0, P,(xo) = fo. If s = 1, 


FAX) = fo + Sfp = fot fi — fo =f 


Pita) = fo + 2Afo + Mf = fr. 


Similarly we can demonstrate that P,(x) formed according to Eq. (3.7) matches at all 
n+ 1 points.* 

In Eq. (3.7), note that differences all fall on a descending diagonal line beginning 
at fo, in Table 3.3. 

We have previously observed that if, over the domain from xp to x, P,(x) and f(x) 
have the same values at tabulated values of x, it is reasonable to assume that they will 
be nearly the same at intermediate points. This assumption is the basis for the use of 
P,(x) as an interpolating polynomial. We again emphasize that f(x) and P,(x) will, in 
general, not be the same function. Hence there is some error to be expected in the estimate 
from such interpolation. We use the polynomial in Eq. (3.7) as an interpolating polynomial 
by letting y take on nonintegra! values. This extends the definition of s so that, for any 
value of x, 


= *—%9 


h 


s 


Write a Newton—Gregory forward polynomial of degree 3 that fits Table 3.2 for the four 
points at x = 0.4 to x = 1.0, Use it to interpolate for f(0.73). 

To make the polynomial fit as specified, we must index the x’s so that x) = 0.4. 
It follows then that fy = 0.423. Afy = 0.261, A’fy = 0.085, and A’/y = 0.096. We com- 
pute s: 


_ 8%, O73 04 


ah 0.2 Bee 


*This demonstration is not a proof, of course. The section on symbolic methods gives perhaps the neatest proof 
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Applying these to Eq. (3.7) with tems through 4°f, to give a cubic, we have 


#(0.73) = 0.423 + (1.650.261) + oe 


_ (1659(0.65(—0.35) 096) 
+ [ae 


= 0.423 + 0.4306 + 0.0456 — 0.0060 
=0893. , 


(0.085) 


(3.8) 


The function tabulated in Table 3.2 is tan x, for which the tue value is 0.895 at 

= 0.73. We hence see that there is an error in the third decimal place. We should 
saiigins widat a toneel Since the third differences are far from constant, our cubic 
polynomial is not a perfect representation of the function. Even so, our interpolating 
polynomial gave 2 fair estimate. certainly better than linear interpolation. which gives 
0.911. 

Even though the fourth differences are also not constant, we would hope for some 
improvement if we approximated f(x) by a fourth-degree polynomial. We can just add 
one more term to Eq. (3.8) to do this: 


5 _ (1.65)(0.65,(—0.35(—1.35) 
(3) = 


F(0.73) = 0.893 + 0.0044 = 0.898. 


(0.211) = 0.0044, 


Normally the higher-degree polynomial is better, but in this instance, adding another term 
does not improve our estimate: in fact it worsens it slightly. This is due to round-off 
errors in the orginal values in this instance. 

The domain over which an interpolating polynomial agrees with the function is most 
readily found by working backward from the last difference that is included in Eq. (3.7), 
drawing imaginary diagonals to the left between the entries. The x-values included between 
this fan of diagonals is the domain of the interpolating polynomial. It will always be 
found to include one more x-entry than the degree of the polynomial. 


3.5 OTHER INTERPOLATING POLYNOMIALS 


it is sometimes convenient to wiite the interpolating polynomial in other forms. The 
Newton—Gregory backward polynomial is 
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Table 3.5 Difference table, using symbols. (This is similar to Table 3.4.) 


-4 x4 Pa 

4 =« 2 

-2 x3 fir ce Wf, am af, 

~I x=] =I Af, Yf-s Mf, A f_s 
0 Xo fo Af. Xf, By te Myf, 
1 x1 fi 0 Lfy -1 Af, 
5 ; I Af, XY, Mfo Aif 
oa 2 2 Ah st ay, 0 
3 Xs fh re Wf, 
4 Xs fa : 


In Table 3.5, it is seen that the differences used here form a diagonal row going upward 
and to the right,* in contrast to the downward sloping diagonal row of differences used 
in the Newton—Gregory forward formula. Trial with various negative integer values of 
§ demonstrates that Eq. (3.9) also matches with data pairs in the table from x = xp to 
KS Kaye 

If the subscripts are suitably chosen, the points where P,(x) matches the table will 
be the same as for the Newton—Gregory forward formula, however. When this is done, 
the two polynomials are really identical though of a different form. We illustrate by 
reworking the same problem as in Section 3.4. 

Choosing x) = 1.0, so that 


gives the cubic that fits Table 3.2 between x = 0.4 to x = 1.0; hence 


(—0.35)(~1.3: 


5 
f(0.73) = 1.557 + (—1.35)(0.527) + 7 00.181) 


+ (0.65)(—0.35)(—1.35) 
6 


1.557 — 0.7114 + 0.0428 + 0.0049 
= 0.893. 


(0.096) 


One observes that the identical result is obtained as before. 
If we again add one more term to make our interpolation correspond to a fourth- 
degree polynomial, we have 


f(0,73) = 0.893 + mr 


(0.052) = 0.894. 


*This ascending diagonal row of differences starting at f; is equal to the backward differences of f,; those in 
the downward sloping row are forward differences, 


196 CHAPTER 3: INTERPOLATING POLYNOMIALS 


In this instance we do improve the estimate, coming closer to the true value of 0.895. 
Why did this fourth-degree polynomial not match the fourth-degree one of Section 3.4? 
In the present case, the domain is from x = 0.2 to x = 1.0, as is found by going back 
diagonally from the last difference, 0.052, in contrast to the domain of x = 0.4 to x = 
1.2. The two fourth-degree polynomials are not identical. 

There is much nonsense in many books about the application of the Newton—Gregory 
forward polynomial only at the beginning of the table, and the backward formula only 
at the end. As our examples clearly show, we may use either formula anywhere in the 
table by suitably subscripting the x’s. Furthermore, the identical results are given by any 
interpolating polynomial that ends on the same difference entry. The reason that books 
tell us to use the forward polynomial at the beginning of a table is so we can increase 
the degree merely by adding a term. The backward polynomial can similarly be increased 
in degree very easily when applied to points near the end of the table. 

There is a rich variety of interpolation formulas beyond the two we have so far 
discussed. They differ in the paths taken through the difference table. For example, the 
Gauss forward goes through the table in a zigzag path, the first step being a forward one. 
Stirling’s and Bessel’s formulas proceed horizontally, using averages of differences, one 
starting with fo and the other one starting halfway between fy and f). 


Formulas for interpolation of equispaced data 


Newton-Gregory Forward (fits at xo to x,,): 


Pix) = fy + ({)a% + (5)s% + (;)s% + (;)a% teeet (:)are. 

Newton-Gregory Backward (fits at x_,, to Xo): 

roar =fo+ (1)arr+ (85 "area + (29a + ("4 Das 
as ala "arf 


Gauss Forward (fits at x_jy/2j tO Xj¢n+1y2))! 
nar o+(()su~ (rat (ater (24x 
ae ( Be RS mA) LY wa, 
Gauss Backward (fits at x—j¢_41y2j tO Xtq/2)): 
P(x) = fo + (ars + oe aa + ( i a2 + fe 3 )ates 
Rp (' = U6/2l) 07 es 
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Stirling (fits at x_{q)2) to X{,/2) if n is even): 


ee 


Px) = fo + vf 
4 \g + "La + Mf, 
3 2 
6 ar : + ( ae ') 
4 4 
te 
ic + vi) i ie + [a - Ee 
Pn i 5 A’ f_fyay if nis even; 
ls + in om) Penal * Af tay if nis odd. 


Bessel (fits at x_j,)2) tO X{(_41)/2) if m is odd): 


pene, 


a Mf, 


pees et) 


ee a + 1\ di, + Me £ Mf 5 
4 
al = if n is even; 


s } - =r Aftin-2yaa + Aft 
2 


Kpae i. - Ne i + Mi 


7 A'f_tuiz if n is odd. 


In the above, 


a 
eae) MW : 


im] . : _ m 
[5] is greatest integer in >. 


To use the formulas, choose xp, compute s = (x — X)/h, and substitute values. 


ees 


197 


We illustrate the use of these formulas by writing several interpolating polynomials 


for the values in Table 3.6. 
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I 


x es Af “fr ay Af 
0.2 1.06894 
0.11242 
0.5 1.18136 0.01183 
0.12425 0.00123 
0.8 1.30561 0.01306 0.00015 
0.13731 0.00138 
Ll 1.44292 0.01444 0.00014 
0.15175 0.00152 
14 1.59467 0.01596 
0.16771 
17 1.76238 


Newton-Gregory forward fitting at x = 0.5 through x = 1.4 (xp = 0.5): 


s 


P3(x) = 1.18136 + s(0.12425) + (;}10.01306) al . 


),0.00138). 


‘ 


Newton-Gregory backward fitting at x = 1.1 through x = 0.2 (%) = 1.1): 


62 


P,(x) = 1.44292 + 5(0.13731) + (’ is ") (0.01306) + ( 5 


)0.00123), 


We could, however, also define x) = 0.2 and write: 


So 


P,(x) = 1.44292 + ( 1 


— >\ _ 
)(o.13731) + (° > )(0.01306) + (° 3 | )10.00123). 


These last two are really identical because the values of s are computed with different 
values of xp so that the coefficients turn out to be the same. For example, at x = 0.35, 
Ss = —2.5 for the first, but s = 0.5 for the second. 


Gauss forward fitting at x = 0.5 through x = 1.1 (x = 0.8): 


Px(x) = 1.30561 + s(0.13731) + (5 )o.01306. 


Gauss backward fitting at x = 0.8 through x = 1.7 (x) = 1.4): 


P(x) = 1.59467 + (0.15175) + (' ¢ "}o.01596 + (eg 


Stirling fitting at x = 0.2 through x = 1.4 (xp = 0.8): 


(3) + (. ad 
onus + 0.13731 , 2 2 


P(x) = 1.30561 + G ; . 


—(0.01306) 
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s + 1) 0.00123 + 0.00138 
Paes ea re 


Bessel fitting at x = 0.8 through x = 1.1 (xp = 0.8): 


(1 ') +) 
1.30561 + 1.44292 1 1 


2 2 


P(x) = 


(0.13731), 


(This is a pretty elaborate way to write a linear polynomial! It reduces to 1.30561 + 
5(0,13731), as given directly by the Newton—Gregory forward formula.) 

We conclude this section with the observation that, since all polynomials through 
the same points are identical, we can use the familiar Newton—Gregory forward poly- 
nomial whenever the points to fit have been preselected from an equispaced table. There 
is no need to memorize any formula besides this one. 


ERROR TERMS AND ERROR OF INTERPOLATION 


Since the interpolating polynomial is not, in general, identical with the unknown function 
f(x) even though they match at certain points, predicting the values of the function at 
nontabulated points is subject to error. 

For instance, as the number of equispaced points increases, the interpolating poly- 
nomials for a given function may actually become less reliable over certain subintervals 
of the domain. An example that illustrates this is the function 


f(x) = 1/0. + 25x?) on 1, = 1] 


As the number of equispaced points n = 1, 3, 5,7, .. . increases, so does the max error 
E,,, defined as 


E,, = max | f(x) — py—1(x)| for x in [-1, 1) 


Here E, is defined for the n (odd) equispaced points, and p is the (n — 1)th- 
degree interpolating polynomial defined by the n points. Values were computed by using 
Program | at the end of the chapter (see Fig. 3.11); 11, 13, 15, 17, 19 equispaced points 
on [—1, 1] were used, The estimate of the max error for each 1 is 


n E,, (approximately) 


0 1.9132554 
13 3.4710393 
15 7.0075584 
17 14.3699041 
19 29.0384012 


From these estimates it is clear that the E,,’s are actually increasing. 
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EXAMPLE 


We leave it as an exercise for you to explain why this magnification of error occurs. 
You will find it instructive to compare the graph of f(x) with the graphs of some of these 
higher-degree polynomials. 

In Section 3.2 we derived the expression for the error of an interpolating poly- 
nomial and have earlier observed that all polynomials of degree n that pass through the 
same n + | points are identical. We can therefore just copy Eq. (3.3) to give our error 
expression: 
fe +1 uf é) 


EQ) = C= Xo = 4) ee GF DI 


& in the interval (xp, x,,)- 
We wish to modify this, expressing it in terms of s = (x — x9)/h, to make it more 
compatible with our interpolating polynomials. Remembering that 
Xy=Xoth, x= Xp + 2h, Cony 


so that 
(x — x9) = sh, (x — x) = sh —h =(s — Ih, 


(x — x9) = sh — 2h = (s — 2)h, 


we find that Eq. (3.8) becomes 


= —=9 = 
EQ) = (ss us ae -(s 1) pnt pint dye) 
= e - Jan grt, E between (Xp, Xp. X5) (3.10) 


Referring again to Eq. (3.7), we observe that the next term after the last one included 
in an nth-degree Newton—Gregory forward interpolating polynomial is G3 arf, One 
can get the error term of Eq. (3.10) by substituting A”*!f"*"(€) for the (n + I)st 
difference. This is true for all interpolating polynomials, not just the Newton—Gregory 
forward one. 


The data below are for sin x. Interpolate to estimate sin (0.8), using a quadratic through 
the first three points; also estimate the error using Eq. (3.10). 


x f(x) Af xf Mf 

0.1 0.09983 

0.37960 
0.5 0.47943 —0.07570 

0.30390 —0.04797 
0.9 0.78333 —0.12367 

0.18023 —0.02846 
1.3 0.96356 —0.15213 

0.02810 


1.7 0.99166 
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We take x) = 0.1, and s = (0.8 — 0.1)/0.4 = 1.75. Then 


.75(0.75 
£(0.8) = 0.09983 + 1.75(0.37960) + 1750-75) _ 9 07570) 
= 0.71445, 
.75(0.75)(—0.2: 
Error = LBOPN-O2) 0.4 (—c0s Ere? 


We don’t know what value to use for € However, € lies in the interval bounded by 
Xo = 0.1 and x) = 0.9, Because the cosine is monotonic in this interval, we can easily 
find maximum and minimum values for cos €. (They occur at the endpoints.) This should 
bound the error: 


_ 1.75(0.75)(=0.25) 


Error = S —(0.4)3[ —cos(0.1)] = 3.48 x 1074; 
.75(0.75)(—0.25 
Error = ho TNK0-29)(0,4)5|-c0s(0.9)] =2.18x 1073. « 


The actual error is 2.90 x 10~3, falling between the two calculated values, as 
expected. Note carefully that the error estimate we have made is for the truncation error 
only (the error that arises because we do not use a polynomial of infinite degree), The 
round-off error, which is negligible in this example, acts independently. 

We now can determine which of several interpolating polynomials, all of the same 
degree, give the best estimates of f(x). It makes no difference which form we use—any 
convenient one is satisfactory, but we will choose the domain where it fits the table in 
such a way as to minimize the error term, Changing the points where P,(x) and f(x) match 
will cause two changes in the error term; the coefficient involving s will vary, and the 
value of f("'*'(é) may vary because the interval in which & lies changes. Since the 
function f(x) is, in general, unknown, we certainly do not know its derivatives, and 
further, the value of € is not known except for the intervals that contain it. We then 
choose the polynomial for which the coefficient involving » has the smallest value. This 
occurs if the value of x, at which the polynomial is to be evaluated (the point we are 
interpolating) lies nearest the midpoint of the interval from x to x,,. Note that this implies 
that extrapolation will normally be less accurate than interpolation, in accordance with 
our intuition. 

In the preceding example where f(x) is known, we reduce the error if we use the 
central three points to fit the polynomial. 

The computations, with xy = 0.5, s = 0.75, give an estimate of f(0.8) = 0.71895. 
The error estimates are —0.7 * 1073 and —2.19 x 10~3, which bracket the actual error 
of —1.59 * 10-4. We see that properly choosing the domain where the polynomial fits 
can reduce the error by nearly one-half. 

Choosing x9 = 0.9, so that the last three points are the domain, will be a bad choice, 
with the greatest error of all. 

What if, as is usual, we do not know f(x)? In that situation, we cannot find bounds 
on f("*)(€). We will later see (Chapter 4) that the nth derivative and the nth differences 
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are related. Anticipating some results from that chapter, we have 


nt 
fae) = a 


In the absence of knowledge of the function, one may use this relation in the error- 
term computation, provided that the (n + 1)st differences do not vary too greatly. This 
accounts for the rough rule of thumb that the error of interpolation is about the magnitude 
of the next term beyond that included in the formula, for with this approximation to 
hv! f"*D(€), the error term is this next term of the polynomial. 

In the above examples, this rule gives the following results: 


For the quadratic that fits at x9 = 0.1 to x. = 0.9, the next term gives 


—-0.2 
roe es nese 0:25) _» 04797) 


2162x1053; 


W 


For the quadratic that fits at x» = 0.5 to x) = 1.3, we get 


—0.25)(-—1.2. 
= 0.75( bax 1.25) _9 99946) 


=—=1.11 x 107. 


Error 


While these are not the same as the exact errors, they are certainly close enough to give 
a good working value. We recommend this technique when the derivative needed in Eq. 
(3.3) or Eq. (3.10) is unknown. 

To keep the error of our polynomial interpolation as small as possible, we have seen 
that we must choose the points to fit our polynomial so that the x-value at which we do 
the interpolation is as well centered as possible. The Gauss forward and backward 
polynomials make it simple to keep the x-value centered in the range of fit, even though 
the degree is increased. Stirling's and Bessel’s formulas act similarly. 

For even-degree polynomials, Newton and Gauss polynomials are poorer in this 
respect than are Stirling’s. Far odd degrees, Bessel’s formulas are preferred. 

In addition to keeping the range symmetrical about the value for interpolation, a 
second most important decision must be made—the degree of the polynomial. The 
truncation error will decrease with the degree, but the round-off error that perturbs the 
differences in the table will increase. This argues for an intermediate degree as being 
most accurate. 

Another problem occurs with high-degree interpolation in certain cases. A local 
irregularity in the tabulated function, due either to a local “bump” in the function or to 
a relatively large experimental error at one point, can cause amplified distortions in the 
interpolating polynomial at points remote from the point of disturbance. Such amplification 
increases with the degree; our only recourse is to keep the degree small, or else we will 
have to approximate the function by a different technique. This phenomenon of amplified 
distortions and alternative techniques for fitting functions to data is discussed later in this 
chapter. 
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To summarize, then, the error of polynomial interpolation is reduced by making its 
range as symmetrical as possible about the point of interpolation, and by choosing a 
higher degree of polynomial, up to the point where round-off or the effect of local 
irregularities cause offsetting errors. But another factor has the greatest importance of 
all—the step size h. With a given set of tabulated data, we may not be able to do much 
about the step size. However, if we are designing a new set of tables, the step size may 
be open to our selection; it is then advantageous to make it small. (This makes our tables 
more bulky and adds proportionately to the computational effort, of course.) 

This effect of A is so important that a special notation is often used to focus attention 
on it. We write 


Error = O(h") 
if there is a constant K such that if h is “small enough,” and 
|Error] = Kh" 


where h > 0 and where K is some constant not equal to zero, The expression O(h") is 
read “order of h to the nth power.” For example, the error of a quadratic interpolating 
polynomial whose range is (xp, x2) would be of order h cubed because 


saan 2 
Error = (Ms = = Drape), or — Error = O(h*), 


” em 


As h gets small, f’"(€) > f'"(xo) because & is squeezed between xg and x2, and hence 
approaches a constant value. 


3.7 DERIVATION OF FORMULAS BY SYMBOLIC METHODS 


In several instances we have presented formulas without proof, although we have dem- 
onstrated their validity in specific cases. Symbolic operator methods are a convenient 
way to establish these relations. We supplement the forward-differencing operator A with 
a backward-differencing operator V and a stepping operator E. These are defined by their 
actions on a function: 


Af(xo) = f(x + A) — F(X), 

Lf (xo) = ALA f(x) = A f(x + A) — Af (0), 

Vf (xo) = f(xo) — f(%o — A), 

V?f(x0) = VIVE(x0)] = Vélxo) — Vflao — A), 

Ef (xo) = f(x + A), 
E?f(xo) = E[Ef(xo)] = Ef (xp + h) = f(xo + 2h), 
E" (xo) = f(xo + nh). (3.11) 


204 CHAPTER 3: INTERPOLATING POLYNOMIALS 


These obvious relationships exist: 


AFG) = Efl%o) — fla) = (E — IF). 
V(x) = f(x) — E~'F (xo) = (1 — EW") f(x). (3.12) 
We abstract from Eq. (3.12) the symbolic operator relations: 
A=E-1, 
V=1—55. (3.13) 
Equations (3.13) are really meaningless, for neither side is defined as it stands. What 
they signify is that the effect of A when operating on a function is the same as the effect 
of operating with (E — 1), and that V and (1 — E~') have the same effect on a function, 
though all these quantities are without significance standing alone. We must apply them 
to a function to interpret them. This is no different, really, from other operational symbols 
that we use, such as V for square root and d/dx for differentiation. 
Since all the operators represented by A, V, and E are linear operators (the effect on 
a linear combination of functions is the same as the linear combination of the operator 
acting on the functions), the laws of algebra are obeyed in relationships between them. 
This means we can manipulate the relations of Eq. (3.13) by algebraic transformations, 


and then interpret the results by letting them operate on a function. For example, by 
raising A = E — | to the nth power, we have 


so 
Af (xo) = [= — nE™) + (j)er oe ‘|fts 
= f(xy + nh) — nf lx + (n — DA] + (5) Fb 4 (ie Dilla cess 
or 


Xo = fa — "fn + (5)fe-2 Saou (3.14) 


Equation (3.14) is a proof of the alternating-sign, binomial-coefficient formula given in 
Eq. (3.5). 
We can develop interesting relations between A and V, such as 
EV=E-E")=E-1=A, 
E"V" = VE" = &, 
A'fy = V'E"fo = V"fn- 
This illustrates the fact that a given difference entry in a table can be interpreted as 
either a forward or backward difference of the appropriate f-values. (This was already 


3.8 


Table 3.7 
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obvious from inspection of a difference table.) We can very simply derive the Newton— 
Gregory forward formula: 


E=1+.A, EBY= (1 +A), 
fe = Ey = (1 + A)y = [} +s + (j)e + (;)s te ‘Nf 
= fy + sh fy + (;)a% 7 (3)a% tree, 
The Newton—Gregory backward formula is similarly easy: 
E‘*=1-V, &=(1-V), 
2 
f= Bf = 1 - Vy = [! eave (75 wee (752) + a 


= fo + s+ (° 5 "va + (* 4 2]v96 + 
= foe ohpi+ (8S ater + (PP arse (3.18) 


We will use symbolic methods to derive derivative formulas in the next chapter. 


INVERSE INTERPOLATION 


Suppose we have a table of data such as Table 3.7 and we are required to find the x- 
value corresponding to a certain value of the function, say at y = 5.0. There are two 
approaches we could use. We can consider the y’s to be the independent variable (unevenly 
spaced) and interpolate for x with a divided-difference polynomial. Doing so gives x = 
2.312. This approach is straightforward but in some instances gives poor results, the 
reason being that x considered as a function of y may not be well approximated by a 
polynomial. This may be true even though y itself behaves very much like a polynomial. 


x fi Flxis Me) fin Mera fe Ms] fs Mal 
1.6 2.3756 
2.9753 
1.9 3.2682 1.5863 
3.7685 0,6317 
2.1 4.0219 2.0917 0.1823 
4.8143 0.8504 
2.4 5.4662 2.8570 
6.8142 


2.8 8.1919 
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(Try inverse: interpolation among: thres-or four pomts:of the: fimetiom:y =>, especiaily: 
for y-vaimes: ouside: the. grverr ramge- ) 
The econo method so (wn y aS polynommal om x amd them use: the methods of 


+ Lie — L6G — 19) 
+ OGITie — BG — Le — ZY) 
+ C1Ste — DG — Le — Le — 2TH (S16) 
Toccomplese-the- problem. we-must tind the-roornear 2S oftins: fourth-desree polynomial: 
fix = SO — Beer = 00 
The: cquivaient of this second technique: for inverse-interpoiation cam be accomplished: 
morereadily by a method of successive-approximation. Gurdivided-diiference-poiynomiai 
iS. nits generai fom. 
> = Fie} + fis, ile — 9) + i, elle — Ge — 4) 
> fe Tint Seer Seale — Gi — Ge — Giz) 
2 (17a) 
Rearrange: to. solve: for c im the second term: 


» — fel — fis, 9.5, G2 — oh — 5.) —- 


+5 (17h 


cal 


Siri, 4] 


‘The method of smecesstve: approximation finds.c, by first negiecting:ail the: terms: inc om 
the-ighe: Takime-¢ = 21, wegee 


_s0—#0nS 
ae Li = Tae 


The: second approximation is obtained using) om the-rigatside-of Eg. (3: 176), including: 
now ome more term Takia oc = 1.9% we ger 


_ 50 — BaF — TOI — LM — ZY _ 
< ETERS 


=r. 
To getun, we use- om the nght-hand side-and pick up another term: 


_ B0i— BE — LOA — UM — 2) — CSS — Mee — Zee — 24) 
% Tes 


Le 
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In the same fashion, taking x; = 1.6, we get 


_ Numerator 
“4 "3.7686 


where Numerator = 5.0 — 2.3756 — 1.5863(x; — 1.6)(x, — 1.9) 


+ 1:6: = 2.3127 


= 0.6317(x3 — 1.6)(x3 — 1.9)0x3 — 2.1) — 0.18230r — 1.6)0x3 — 1.9)(x3 — 2.1)x5 — 2.4). 


3.9 


Do you see why the values for x; were changed? It was to use differences that are 
best centered on y = 5.0. 

The data in Table 3.7 actually are for y = sinh x. Substitution in the hyperbolic sine 
function gives sinh (2.3127) = 5.0013. Our computation is off by 0.0013. 


INTERPOLATING WITH A CUBIC SPLINE 


The fitting of a polynomial curve to a set of data may be considered from another point 
of view, that of the draftsman. We wish to fit a “smooth curve” to the points. One could 
use a French curve on the drafting table, but this is a very subjective operation. Fitting 
a polynomial of high degree to a set of six or eight points, say, does not appeal to us, 
since we do not expect that the functional relationship is that complicated. 

Not only do high-degree polynomials seem undesirable on the basis of our intuition; 
in some cases they show unexpectedly large deviations from a smooth curve through the 
data. As an extreme case, in which this effect is exaggerated, consider this function: 


f(x) = 0, -10s% 5-02: 
f(x) = 1 — |5x|, -0.2 <x < 0.2; 
fix) = 0, 0.2<x< 1.0. 


Suppose we fit exact polynomials of degree 2, 4, 6, and 8, all agreeing with f(x) 
over the interval —] = x = 1. Figure 3.1 shows the results. The wide swings when the 
degree is high, which are caused by a local “bump” in the function at x = 0 but which 
occur at points far removed from x = 0, are strong arguments against using high-degree 
polynomials in cases such as this, when a function is generally smooth but has some 
local roughness. (The roughness might be only apparent, being caused by an abnormally 
large error in one of the points.) 

One solution to this problem is fitting subregions of the region —1 = x = 1 with 
low-degree polynomials. Figure 3.2 illustrates how quadratics would accomplish this. 
The difficulty with this approach is that the slope is discontinuous at the points where 
the quadratics join. If we have a generally smooth function, this is undesirable. We seek 
a method that retains smoothness where the function is smooth. but still can fit local 
irregularities without the violent misbehavior exhibited in our example above. 


Figure 3.2 
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f(x) 


One technique that is becoming increasingly important is the so-called spline fitting 
of a curve, The name derives from another draftsman’s device. A spline is a flexible strip 
that can be held by weights so that it passes through each of the given points but goes 
smoothly from each interval to the next according to the laws of beam flexure. The present 
mathematical procedure is an adaptation of this idea. It is particularly advantageous when 
we want to find derivatives of the data. 

The conditions for a cubic spline fit are that we pass a set of cubics through the 
points, using a new cubic in each interval. To correspond to the idea of the draftsman’s 
spline, we require that both the slope and the curvature be the same for the pair of cubics 
that join at each point. We now develop the equations subject to these conditions. 

Write the cubic for the ith interval, which lies between the points (x), y,) and 
(Xi415 Yiv1) in the form 


y = a,x — x)? + bx — x)? + e(x — x) + d). (3.18) 


Since it fits at the two endpoints of the interval, 


yy = a(x; — x) + bx — xj)? + ox; — x) + d; = dp (3.19) 
Very = pay — x)? + bi Kpar — 2)? + ci ay — mH) + 4; 
= ajh} + bik? + ch; + d;. (3.20) 


In the last equation, we use h,; for Ax in the ith interval. We need the first and second 
deriviatives to relate the slopes and curvatures of the joining polynomials, so we differ- 
entiate Eq. (3.18): 


= 3a,(x — x;)? + 2b(x — x;) + ¢;, (3.21) 


y" = 6aj(x — x;) + 2b;. (3.22) 


The mathematical procedure is simplified if we write the equations in terms of the 
second derivatives of the interpolating cubics. Let S; represent the second derivative at 
the point (x;, y;) and S;,, at the point (4+), yj+1)- 
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From Eq. (3.22) we have 


2 5; = 6ax; — x;) + 2b; 
= 2b; 
Sis = 6aj(x+1 — x;) + 2b; 
= 6a;h; + 2b,;. 
Hence we can write 
b, = S,/2, (3.23) 
a; = (S;- — S;)/6h;. (3.24) 


We substitute the relations for a;, b;, d; given by Eqs. (3.19), (3.23), and (3.24) into 
Eg. (3.20) and then solve for c,: 


2 StS a3 Sina 7 
ye = ( Cit an ee 


es eee 2hS; + hySios 


& h, 6 


We now invoke the condition that the slopes of the two cubics that join at (x,, y,) 
are the same. For the equation in the ith interval, Eq. (3.21) becomes, with x = x;, 
¥} = 3a(x; — x)? + 2b(x; — x) + c7 = cj. 
In the previous interval, from x;_, to x;, the slope at its right end will be 


Ye = Bey, — Xy=4)? + Wey — Hi-y) + Ci 
= 3a;_yh3_y + 2bjyhy-y + C-1- 


Equating these, and substituting for a, b, c, d their relationships in terms of S and y, we 
get 


Yio —  _ DAS; + Aix) 
6 


= / = +h 
Z 3(® Set\ie rs 2(%1)h,-1 we Th So fy Sy 


ia a 


On simplifying this equation we get 


IaSi-1 + Oh-1 + 2AIS; + hier = 6(A—M— MMe) 3.28) 
a i-1 


= O(FE%;, X:+1] — fPG-1, ))- 


Equation (3.25) applies at each internal point, from i = 1 toi = n — 1, there being 
n+ 1 total points. This gives n — 1 equations relating the n + 1 values of S;. We get 
two additional equations involving Sp and S,, when we specify conditions pertaining to 
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the end intervals of the whole curve. To some extent, these end conditions are arbitrary. 
Four* alternative choices are often used: 


Take Sy = 0. S,, = 0. This is equivalent to assuming that the end cubics 
approach linearity at their extremities. 

Take Sy = S;. S,, = S,-;. This is equivalent to assuming that the end 
cubics approach parabolas at their extremities. 

Take Sp as a linear extrapolation from 5, and S>, and S,, as a linear 
extrapolation from S,-, and S,->. With this assumption, for a set of 
data that are fit by a single cubic equation, their cubic splines will all 
be this same cubic. The other conditions do not have this property. 


The relations for end condition 3 are as follows: 
At the left end: 


_ (ig + hy)Sy — hoS3 
hy : 


So 


At the right end: (3.26) 


Sn = Sa-1 = Sa-) = 


(Rpmz + My W)Sn—) = hy=1Sn-2 


Ay) Ina Iin2 


5,= 


Force the slopes at each end to assume certain values. We may have 
to make estimates of these slopes. 


The relations for end condition 4 are (with f'(xg) = A and f'(x,) = 
B): 


At the left end: 
2hoSy + 41S, = 6(f[xo. 1] — A)- 
At the nght end: 
Ity iSpy + WpSy = OB fl%q—1« Xq))- 


Observe that we use divided differences here. 


Relation 1, where Sp = 0 and S,, = 0, is called a natural spline. It is often felt that 
this flattens the curve too much at the ends: in spite of this, it is frequently used. Relation 
3 frequently suffers from the other extreme, giving too much curvature in the end intervals. 


*A fifth condition is sometimes encountered—a function is periodic and the data cover a full period. In this 
case, Sy = S, and the slopes are also the same at the first and last points. 


a am 


—— 


— 
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Probably the best end condition to use is condition 4, provided reasonable estimates of 
the derivative are available. 


If we write the equation of S,, 52, . . . , S,—; (Eq. (3.25)) in matrix form, we get 
So 
hy {tg + hy) hy 5, 
hy 2h, + hy) hy Sz 


hy 2(hy + h3) hy 53 


fix, 2) = flxo, x) 

flx2,%3)  — fly, x2) 

flx3, x4] — flx2, x3] (3.27) 
=6 c 


fBn-15 Xn) — FXp-2» n-1] 


In the matrix array above, there are only n — | equations, but n + 1 unknowns. We 
can eliminate two unknowns (Sp and S,,) using the above relations that correspond to the 
end-condition assumptions. In the first three cases, this reduces the S vector ton — 1 
elements, and the coefficient matrix becomes square, of size (n — 1 x n — 1). Furthermore, 
the matrix is always tridiagonal (even in case 4), and hence is solved speedily and can 
be stored economically. A program is given at the end of this chapter that creates this 
tridiagonal matrix and augments it with the proper right-hand-side vector, then solves for 
the values of § in each interval. You will remember that S is the second derivative of the 
cubic in each interval, so there is more work to do to get the actual cubic that makes the 
interpolating curve. 

For each end condition, the coefficient matrices become 


Condition 1 So = 0, S,, = 0: 
ho + hy) hy 


hy 2h, + hy) hy 
hy Any + hs) Ay 


Ay-2 2(hp-2 + Ity-1) 
Condition 2 So = Sy, Sy = Sy-1: 
(3hg + 2hy) hy 


hy 2(hy + hy) hy { 
hy hy + hs) hs 


hy (2hy—2 + Fhty-1) 
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Condition 3 So and S,, are linear extrapolations: 


(ig + hy lig + 2hy) 1h — HB 


hy hy : 
h 2h) +h 2 
; : oe 2) alt + hy) hy 


Bap — AR) yy + My-2 lin) + 2hy-2) 
Ryo tiga 


Condition 4 f'(%o) = A and f'(x,) = B: 


| 2ho hy 
fg (lg + hy) hy 
hy 2h, + hy) hy 


Mn-y Rn 


With condition 3, after solving the set of equations, we must compute Sp and S,, 
using Eq. (3.26). For conditions 1, 2, and 4, no computations are needed. For each of 
the first three cases, the right-hand-side vector is the same; it is given in Eq. (3.27). If 
the data are evenly spaced, the matrices reduce to a simple form. 

After the S; values are obtained, we get the coefficients a;, b,, c,, and d, for the 
cubics in each interval. From these we can compute points on the interpolating curve. 


= Sin ~ Sj, 
ia Aa, 
S; 
b, = 3 
5 wy ¥y _ 2hS; + HiSie1 
fp es 
d=; 


EXAMPLE! Fit the data of Table 3.8 by a cubic spline curve. (The true relation is just y = x9 — 8.) 
Use all four end conditions. 


Table 3.9 “es SEE EEE 


¥ y 
0 -8 
1 —T. 
2 0 
3 19. 
4 56 
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For end condition 1. we solve 


Se 


So=0, S,=64285, S,=102857, $,=244%5, 5,=0. 
For end condition 2. we solve 


(BL 


S=$.90, S,=490, S,=120 S,=1920, S,=192. 
For end conditioe 3, we solve 


so that 


so that 


S=00. §,=60. F=2O S$,=BO $,=4O. 


2100 0Tfs, 6 
i410 0}s, 6 
01410}s|=| 2 
oo1r-ztiis, 108 
oo0o01 2s, te 


so that 
S=0.0. $§,=60. S.=DO $,=-8O §,=4™mM 
The equation of he data. y = x° — 8. gives seconddenvative valess tha mach 
these obtumed by usme comitwes 3 ad+ os 
‘The meat cumple is 2 moce realistic stteation im whack 2 cabuc splmec maehet be wsed_ 


EXAMPLE? The dete imp the follows bie ae from astronomical observations of 2 type of warble 
sem called 2 Cepheid verasbic, and repecsent varuisoes im fs apparest msetad wah 
tme- 


Table 3.9(a) 
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0.302 


Apparent 0.185 0.106 0.093 0.240 0.579 0,561 0.468 0.302 


magnitude 


Use each of the four end conditions to compute cubic splines, and compare the values 
interpolated from each spline function at intervals of time of 0.05. 

The augmented matrices whose solutions give values for S,, $3, ... , S7 are shown 
in Table 3.9(a). The computer program at the end of the chapter was used, giving the 
results shown in Table 3.9(b). 


Condition | 


Matrix coefficients are 


0,20 0.60 0.10 eli23 
0.10 0.40 0.10 3.96 
0.10 0.40 0.10 9.60 
0.10 0.40 0.10 11.52 
0.10 0.40 0,10 —21.42 
0.10 0.40 0.10 —4.50 
0,10 0.60 0.20 0.60 
Condition 2 


Matrix coefficients are 


0.20 0.80 0.10 —1.23 
0.10 0.40 0.10 3.96 
0.10 0.40 0.10 9.60 
0.10 0.40 0.10 11.52 
0.10 0.40 0.10 —21.42 
0.10 0.40 0.10 —4.50 
0.10 0.80 0.20 0.60 
Condition 3 
Matrix coefficients are 
0.20 1.20 —0.30 1.23 
0.10 0.40 0.10 3.96 
0.10 0.40 0.10 9.60 
0.10 0.40 0.10 11,52 
0.10 0.40 0.10 21.42 
0.10 0.40 0.10 —4,50 


—0.30 1.20 0.20 0.60 


4) 
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Table 3.9(a) (continued) 


Condition + 

Matrix coefficients are 

1.00 0.40 0.10 0.00 

0.20 0.60 0.10 —1.23 

0.10 0.40 0.10 3.96 

0.10 0.40 0.10 9.60 

0.10 040 0.10 11.52 

0.10 0.40 0.10 —21.42 

0.10 0.40 0.10 —4.50 

0.10 0.60 0.20 0.60 

0.10 0.40 0.00 0.00 

Table 3.9(b) 
1 
; Values Values, Values, Values, 
condition 1 condition 2 condition 3 condition 4* 

0.00 0.302 0.302 0.302 0.302 
0.05 0.278 0.282 0.297 0.276 
0.10 0.252 0.256 0.271 0.250 
0.15 0.222 0.224 0.231 0.221 
0.20 0.185 0.185 0.185 0.185 
0.25 0.143 0.142 0.141 0.143 
0.30 0.106 0.106 0.106 0.106 
0.35 0.087 0.088 0.088 0.087 
0.40 0.093 0.093 0.093 0.093 
0.45 0.133 0.133 0.133 0.133 
0.50 0.240 0.240 0.240 0.240 
0.55 0.424 0.424 0.424 0.424 
0.60 0.579 0.579 0.579 0.579 
0.65 0.608 0.608 0.608 0.608 
0.70 0.561 0.561 0.561 0.561 
0.75 0.511 0.511 0.511 0.511 
0.80 0.468 0.468 0.468 0.468 
0.85 0.426 0.426 0.430 0.426 
0.90 0.385 0.384 0.392 0.385 
0.95 0.343 0.343 0.350 0.343 
1.00 0.302 0.302 0.302 0.302 


“Note that in the values for condition +, we used forward and backward differences to approximate the slope at either end of 
the curve; that is, ¥"(0.0) = —O_S8S and ¥"(1.0) = —0.830. 


A graph of the four solutions is shown in Figure 3.3. The points all are so close to 
each other that we must magnify the portions near the ends to see the differences. In the 
central part of the curve, between t = 0.2 and 0.8. none differ by more than 0.001. 


Figure 3.3 
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OTHER SPLINES 


In addition to the splines we have studied in the previous section, there are others that 
are important. In particular, Bezier curves and B-splines are widely used in computer 
graphics and computer-aided design. These two types of curves are not really interpolating 
splines, since the curves do not normally pass through all of the points. In this respect 
they show some similarity to least-squares curves that are discussed in a later chapter. 
However, both Bezier curves and B-splines have the important property of staying within 
the polygon determined by the given points. We will be more explicit about this property 
later. In addition, these two new spline curves have a nice geometric property in that in 
changing one of the points we change only one portion of the curve, a “local” effect. 
For the cubic spline curve of the previous section, by changing just one point we would 
have a “global” effect, in that the curve from the first to the last point would be affected. 
Finally, for the cubic splines just studied the points were given data points. For the two 
curves we study in this section the points in question are more likely “control” points we 
select to determine the shape of the curve we are working on. 

For simplicity, we consider mainly the cubic version of these two curves. In what 
follows, we will express y = f(x) in parametric form. The parametric form represents a 
relation between x and y by two other equations, x = F\(w), y = F2(u). The independent 
variable wu is called the parameter. For example, the equation for a circle can be written, 
with @ as the parameter, as 


x = rcos(@), 


y =rsin(@). 
When y and x are expressed in terms of a parameter uw, (x(u), y(w)), 0 = u = 1, defines 
a set of points (x, y), associated with the values of u. We discuss Bezier curves before 
B-splines. Bezier curves are named after the French engineer, P. Bezier of the Renault 
Automobile Company. He developed them in the early 1960s to fill a need for curves 
whose shape can be readily controlled by changing a few parameters. Bezier’s application 
was to construct pleasing surfaces for car bodies. 

Suppose we are given a set of control points, pj = (x), y,),i=0,1,..., n. (These 
points are also referred to as Bezier points.) Figure 3.4 is an example. 
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Figure 3.4 


(Xs Yo) 

: ey) 
, (Yn) 
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These points could be chosen on a computer screen either by a mouse or a light pen. 
The points do not necessarily progress from left to right. We treat the coordinates of each 
point as a two-component vector, 


The set of points, in parametric form, is 


Plu) = (29) O<u<l. 
yw), 


The nth-degree Bezier polynomial determined by n + | points is given by 


n 
Pu) => Ne Ja — u)"“u'p,, where 


i 


=0 
()- aa 
i i = = 


(The above really represents two other scalar equations, one for x; and the other for y,.) 
For n = 2, this would give the quadratic equation defined by three points, pp, Pp), P2: 


P(u) = (1)(L — u)*pg + 20. — u)(u)p, + (Lup, 


since, forn = 2 andi = 0, 1, 2, we have (3 ) = 1, (3 yi Bi (3) = 1. The above equation 
represents the pair of equations 


x(u) = (1 —u)?xq + 2(1 —u)(u)x, + u?xp, 
yu) = (1 = w)?y9 + 201 = u(y, + uP y2. 


Figure 3.5(a) 
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Observe that, if w = 0, x(0) is identical to xq and similarly for (0). If u = 1, the point 
referred to is (x, y2). As u takes on values between 0 and 1, a curve is traced that goes 
from the first point to the third of the set. Ordinarily the curve will not pass through the 
central point of the three. (If they are collinear, the curve is the straight line through 
them all.) In effect, the points of the second-degree Bezier curve have coordinates that 
are weighted sums of the coordinates of the three points that are used to define it. From 
another point of view, one can think of the Bezier equations as weighted sums of three 
polynomials in u, where the weighting factors are the coordinates of the three points. 
Applying the general defining equation for n = 3, we get the cubic Bezier polynomial 
that we shall consider in some detail. The properties of other Bezier polynomials are the 
same as for the cubic. Here is the Bezier cubic: 


= (1 — wixy + 31 — wu, + 30 — wie + W325, 


(1 — uyo + 31 — w)Puy, + 31 — wuty, + wy. 


Observe again that (x(0), y(0)) = po and (x(1), y(1)) = p3, and that the curve will 
not ordinarily go through the intermediate points. As illustrated in the example curves 
below, changing the intermediate “control” points changes the shape of the curve. The 
examples are in Figures 3.5(a) through 3.5(e). The first three of these show Bezier curves 
defined by one group of four points. 

Figures 3.5(d) and 3.5(e) demonstrate how cubic Bezier curves can be continued 
beyond the first set of four points; one just subdivides seven points (pp to pe) into two 
groups of four, with the central one (p3) belonging to both sets. Figure 3.5(e) shows 
that p2, py, and p, must be collinear to avoid a discontinuity in the slope at p3. 


Pig 


Ps 


220 CHAPTER 3: INTERPOLATING POLYNOMIALS 


Figure 3.5(b) 


_* Pi 


Figure 3.5(c) 
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Figure 3.5(d) Points p2, ps, ps are not collinear. 


Figure 3.5(e) Points p, p3, ps are collinear. 
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It is of interest to list the properties of Bezier cubics: 


P(O) = po, PCL) = ps. 


Since dx/du = 3(x, — xo) and dy/du = 3(y, — yo) at u = 0, the slope 
of the curve at u = 0 is dy/dx = (y,; — yo)/(x; — xo), which is the 
slope of the secant line between po and p,. Similarly, the slope at u = 


1 is the same as the secant line between the last two points. In the 
figures, this is indicated by dashed lines. 

The Bezier curve is contained in the convex hull determined by the four 
points. 


The convex hull of a set of points is the smallest convex set containing all the points. 
The following sketches show examples of the convex hull of four points. 


Pi 


Ps 


Py 


It is often convenient to represent the Bezier curve in matrix form. For Bezier cubics, 
this is 


3 

3) 42 306 

Plu) = 1/6{u?, uw, u, | 3 
0 


coco 
a] 
s 


ul™M>p/6. 


We now discuss B-splines. These curves are like Bezier curves in that they do not 
ordinarily pass through the given data points. They can be of any degree, but we will 
concentrate on the cubic form. Cubic B-splines resemble the ordinary cubic splines of 
the previous section, in that a separate cubic is derived for each pair of points in the set. 
However, the B-spline need not pass through any of the set that are used in its definition. 

We begin the description by stating the formula for a cubic B-spline in terms of 
parametric equations whose parameter is u: 

Given the points p; = (x;, y;),i=0,1,..., n, the cubic B-spline for the interval 
(pi, Pi+1), §= 1, 2,...,n — 1, is 


2 
Bu) = pe by pj +x. Where 
by 


il 


(1 — u)3/6, 
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= 13/2 — u? + 2/3, (3.28) 
—u3/2 + w?/2 + u/2 + 1/6, 
b= w/6, Osu, 


SF 


As before, p; refers to the point (x;, y; ); itis a two-component vector. The coefficients, 
the b,’s, serve as a basis and do not change as we move from one pair of points to the 
next. Observe that they can be considered weighting factors applied to the coordinates 
of a set of four points. The weighted sum, as uv varies from 0 to 1, generates the B-spline 
curve. 

If we write out the equations for x and y from Eq. (3.28), we get 


x(u) = La — u)x;_-, + Lau? — 6u? + 4)x; 


1 1 
+ g( 303 + 3u2-+ 3u + Uxpay + Gi x +23 


rl 
y(u) = Z(1 — w)3yj-, + rac — 6u? + 4)y, 
(—3u3 + 3u? + 3u + Lyj4) + buys. 


Note the notation here: x,(u) is a function (of u) and x;, y; are the components of the 
point p,. 

As we have said, the u-cubics act as weighting factors on the coordinates of the four 
successive points to generate the curve. For example, at u = 0, the weights applied are 
1/6, 2/3, 1/6, and 0. At u = 1, they are 0, 1/6, 2/3, and 1/6. These values vary 
throughout the interval from wu = 0 to uw = 1. As an exercise, you are asked to graph 
these factors. This will give you a visual impression of how the weights change with w. 

Let us now examine two B-splines determined from a set of exactly four points. 
Figures 3.6(a) and 3.6(b) show the effect of varying just one of the points. As you would 
expect, when p) is moved upward and to the left, the curve tends to follow; in fact, it 
is pulled to the opposite side of p,.You may be surprised to see that the curve is never 
very close to the two intermediate points, though it begins and ends at positions somewhat 
adjacent. It will be helpful to think of the curve generated from the defining equation for 
B, as associated with a curve that goes from near p, to pp. It is also helpful to remember 
that points po, Pp), P2, and p3 are used to get B). 

Since a set of four points is required to generate only a portion of the B-spline, that 
associated with the two inner points, we must consider how to get the B-spline for more 
than four points as well as how to extend the curve into the region outside of the middle 
pair. We use a method analogous to the cubic splines of Section 3.9, marching along 
one point at a time, forming new sets of four. We abandon the first of the old set when 
we add the new one. 

The conditions that we want to impose on the B-spline are exactly the same as for 
ordinary splines—continuity of the curve and its first and second derivatives. It turns out 
that the equations for the weighting factors (the u-polynomials, the b,) are such that these 
requirements are met. Figure 3.7 shows how three successive parts of a B-spline might 
look. (We will consider how to fill in the end portions in a moment.) 
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Figure 3.6(a) 


Figure 3.6(b) 


pe 


id 


pe 


oP 


Pp 


Am 


Figure 3.7 
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Successive B-splines are joined together 


We can summarize the properties of B-splines as 


Like the cubic splines of Section 3,9, B-splines are pieced together so 
they agree at their joints in three ways: 

a) B(l) = B,,\(O) = (p; + Ape, + pi+2)/6- 

b) Bi(1) = B;,\(0) = (—p; + p)+2)/2. 

c) Bi) = BY, (0) = (py, — 241 + Di+2)- 

The subscripts here refer to the portions of the curve and the points in 
Figure 3.7. 


2. The portion of the curve determined by each group of four points is 
within the convex hull of these points. 


Now we consider how to generate the ends of the joined B-spline. If we have points 
from po to p,, we already can construct B-splines B, through B, 5. We need Bo and B,,_ ;. 
Our problem is that, using the procedure already defined, we would need additional points 
outside the domain of the given points. We probably also want to tie down the curve in 
some way—having it start and end at the extreme points of the given set seems like a 
good idea. How can we do this? 

First, we can add more points without creating artificiality by making the added 
points coincide with the given extreme points. If we add not just a single fictitious point 
at each end of the set, but two at each end, we will find that the new curves not only 
join properly with the portions already made, but start and end at the extreme points as 
we wanted. (It looks like we have added two extra portions, but reflection shows these 
are degenerate, giving only a single point.) 

In summary—we add fictitious points p_>, p_), Pn+), and p,.+2, with the first two 
identical with po and the last two identical with p,. (There are other methods to handle 
the starting and ending segments of B-splines that we do not cover.) 
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The matrix formulation for cubic B-splines is helpful. Here it is: 


= 
1/6{u3, u?, u, 1] _ 


3 


u™M,p/6. 


This applies on the interval [0, 1] and for the points (p;, p; +). 

We conclude this section by looking at several examples of B-splines. The five parts 
of Figure 3.8 show B-splines that are defined by the same set of points as the Bezier 
curves in Figure 3.5. There are significant differences. 

B-splines differ from Bezier curves in three ways: 


For a B-spline, the curve does not begin and end at the extreme points. 


The slopes of the B-splines do not have any simple relationship to lines 
drawn between the points. 


The endpoints of the B-splines are in the vicinity of the two intermediate 
given points, but neither the x- nor the y-coordinates of these endpoints 
normally equal the coordinates of the intermediate points. 


Figure 3.8(a) 


Figure 3.8(b) 


Figure 3.8(c) 
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Figure 3.8(d) 


Figure 3.8(e) 


3.11 


EXAMPLE 


Table 3.10 
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POLYNOMIAL APPROXIMATION OF SURFACES 


When a function z is a polynomial function* of two variables x and y, say of degree 3 
in x and of degree 2 in y, we would have 


z =f x, y) = ag + ayx + any + 3x? + agxy + asy? + agx3 
+ a7x?y + agxy? + agxty + ayx?y? + ayyxty?. (3.30) 


Such a function describes a surface; (x, y, z) is a point on it. The functional relation is 
seen to involve many terms. If we are concerned with four independent variables (three 
space dimensions plus time, say), even low-degree polynomials would be quite intractable. 
Except for special purposes, such as when we need an explicit representation, perhaps 
to permit ready differentiation at an arbitrary point, we can avoid such complications by 
handling each variable separately. We will treat only this case. 

Note the immediate simplification of Eq. (3.30) if we let y take on a constant value, 
say y = c, Combining the y factors with the coefficients, we get 


Z| yao = bo + byx + box? + bax}. 


This will be our attack in interpolating at the point (a, b) in a table of two variables— 
hold one variable constant, say y = y,, and the table becomes a single-variable problem. 
The above methods then apply to give f(a, y,). If we repeat this at various values of y, 
VY = Yo, ¥3, ++ + + Ys We will get a table with x constant at the value x = a and with y 
varying. We then interpolate at y = b. 


Estimate f(1.6, 0.33) from the values in Table 3,10, Use quadratic interpolation in the 
x-direction and cubic interpolation for y. We select one of the variables to hold constant, 
say x. (This choice is arbitrary since we would get the same result (except for differences 
due to round-off) if we had chosen to hold y constant.) We decide to interpolate for y 
within the three rows of the table at x = 1.0, 1.5, and 2.0 since the desired value at 
x = 1.6 is most nearly centered within this set, We choose y-values of 0.2, 0.3, 0.4, 
and 0.5 so that y = is centralized. The shading in Table 3.10 shows the region of 
fit for our polynomia 


Tabulation of a function of two variables z = f(x, y) 


0.2 0.3 0.4 0.5 0.6 
Is 0.428 0,687 0.942 1.190 1.431 
1.0 0.271 0.640 1,003 1359, 1,703 2.035 
se 0,447 0.990 1.524 2.045, 2.549 3.031 
2.0 0.738 1,568 2.384 3177 3.943 4.672 
pe} 1.216 2.520 3.800 5.044 6.241 7,379 
3.0 2.005 4.090 6.136 8.122 10,030 11.841 


16,277 


*We approximate a nonpolynomial function by a polynomial that agrees with the function, just as we have 
done with a function of one variable 
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We may either use divided differences or derive the interpolated values using dif- 
ference tables. Let us use the latter method since the data are evenly spaced. 


y z Az &: 
0.2 0.640 
0.363 
0.3 1,003 —0.007 
x= 1.0 0.356 —0.005 
0.4 1.359 —0.012 
0.344 
0.5 1,703 
0.2 0.990 
0.534 
0.3 1.524 —0.013 
x=15 0.521 —0.004 
04 2.045 —0.017 
0.504 
0.5 2.549 
0.2 1.568 
0.816 
0.3 2.384 —0.023 
x=2.0 0.793 —0.004 
0.4 3.177 —0.027 
0.766 
0.5 3.943 


We need the subtables from y = 0.2 to y = 0.5 since, for a cubic interpolation, four 
points are required. Using any convenient formula, we arrive at the results: 


1.0 1.1108 
0.5710 
y = 0.3341.5 1.6818 0.3717 
0.9427 
2.0 2.6245 


In the last tabulation we carry one extra decimal to guard against round-off errors. 
Interpolating again, we get z = 1.8406, which we report asz = 1.841.» 
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The function tabulated in Table 3.10 is f(x, y) = e* sin y + y — 0.1, so the true 
value is f(1.6, 0.33) = 1.8350, Our error of —0.006 occurs because quadratic interpolation 
for x is inadequate in view of the large second difference. In retrospect, it would have 
been better to use quadratic interpolation for y, since the third differences of the y-subtables 
are small, and let x take on a third-degree relationship. 

It is instructive to observe which of the values in Table 3.10 entered into our com- 
putation. The shaded rectangle covers these values. This is the “region of fit” for the 
interpolating polynomial that we have used. The principle of choosing values so that the 
point at which the interpolating polynomial is used is centered in the region of fit obviously 
applies here in exact analogy to the one-way table situation. It also applies to tables of 
three and four variables in the same way. Of course, the labor of interpolating in such 
multidimensional cases soon becomes burdensome. 

A rectangular region of fit is not the only possibility. We may change the degree of 
interpolation as we subtabulate the different rows or columns. Intuitively, it would seem 
best to use higher-degree polynomials for the rows near the interpolating point, decreasing 
the degree as we get farther away. The coefficient of the error term, when this is done, 
will be found to be minimized thereby, though for multidimensional interpolating poly- 
nomials the error term is quite complex. The region of fit will be diamond-shaped when 
such tapered degree functions are used. 

We may adapt the Lagrangian form of interpolating polynomial to the multidimen- 
sional case also. It is perhaps easiest to employ a process similar to the above example. 
Holding one variable constant, we write a series of Lagrangian polynomials for inter- 
polation at the given value of the other variable, and then combine these values in a final 
Lagrange form. The net result is a Lagrangian polynomial in which the function factors 
are replaced by Lagrangian polynomials. The resulting expression for the above example 
would be 


(y — 0.3)(y — 0.4)(y — 0,5) 


(0.2 — 0.3)(0.2 — 0.4)(0.2 — 0.5) 


fe 


(x— 1.5)(x — 2.0) (x — 1.0)(x — 2.0) @= 1.0)@— 1.5) 
[a = 1.51.0 = 2,0) 04) + 57.5 — 2.0) °9™ * Go = 1.02.0 — 1.5)(1 8) 
(y — 0.2)(y — 0.4)(y — 0.5) 
(0.3 — 0.20.3 — 0.40.3 — 0.5) 
(x — 1.5)(x — 2.0) (x — 1.0)(x — 2.0) @& — 1.0) — 1.5) _/, 
[aS =1.5KL0 = 2,01) * 5 = Tos — 2.0134 * Zo 1.02.0 — 1.5)(2-384) 
(y — 0.2)(y — 0.3)(y — 0.5) 
(0.4 — 0.20.4 — 0.30.4 — 0.5) 
(x — 1.5)(x — 2.0) (e — 1.00 — 2.0) (x — 1.0)(x — 1.5) 
[a4 =T.5)(1.0 — 2.1359) * 5 —1ovt.s — 2.07%) * Go 1.02.0 - 1.59177) 
(y — 0.2\(y — 0.3 — 0.4) 
@.5 — 0.2)(0.5 — 0.3)(0.5 — 0.4) 
(x — 1.5)(e — 2.0) (x — 1.0)( — 2.0) (x — 1.0)(x — 1.5) 
3 
[a =1.5K10 = 2.597) * 5 —Tod.s — 2.02 * Zo 1.90.0 — 1.5) | 
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The equation is easy to write, but its evaluation by hand is laborious. If one is writing 
a computer program for interpolation in such multivariate situations, the Lagrangian form 
is recommended. Theré is a special advantage in that equal spacing in the table is not 
required. The Lagrangian form is also perhaps the most straightforward way to write out 
the polynomial as an explicit function. 

When the given points are not evenly spaced, the above method using Lagrangian 
polynomials or the method of divided differences would be used for interpolation. With 
the latter, exactly the same principle is involved—hold one variable constant while 
subtables of divided differences are constructed, then combine the interpolated values 
from these subtables into a new table. 

Another alternative is to use cubic splines for interpolation in multivariate cases. 
There again it is perhaps best to hold one variable constant while constructing one-way 
splines, then combine the results from these in the second phase. The computational effort 
would be significant, however. 

Interpolating for values of functions of two independent variables can also be thought 
of as constructing a surface that is defined by the given points. Rather than finding values 
on a surface that contains the given points, we can construct surfaces that are analogous 
to Bezier curves and B-spline curves where the surface does not normally contain the 
given points. 

So far we have been able to interpolate simple surfaces where we are given z as 
a function of x and y. Suppose now we are given a set of points, p; = {(x;, yj, 2). i = 
Dp seinen n}, and we wish to fit a surface to those points. This would be the case if we 
were trying to draw a mountain, an airplane, or a teapot. But first we consider the represen- 
tation of more general surfaces. Let p = (x, y, z) be any point on the surface. Then the 
coordinates of each point are represented as the equations 


x= x{u, ¥); 


yu, Vv), 


y 


z= 2u, v), 


where u, v are the independent variables that range over a given set of values, and x, y, 
z are the dependent variables. This is a slight change of notation from the first part of 
this section. 

An example of this would be the equations of a sphere of radius r about the origin: 
(0, 0, 0). Here any point on the surface of the sphere is given by 


x = r-cos(u) sin(v), 


r sin(u) sin(v), 


“ 
il 


I 


r cos(v), 


ww 


where u ranges in value from 0 to 27, and v ranges from 0 to z. Figure 3.9 illustrates 
this. 


Figure 3.9 
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mH 
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We will only describe constructing a B-spline surface. (A most interesting and inform- 
ative description of Bezier surfaces can be found in Crow (1987).) 

From the previous section, we know that a cubic B-spline curve segment starting 
near the point p; to near the point p,.; is determined by the four points 


@ Pins 


where p,(u) = (x,(w), y;(u)) in two dimensions, or p,(u) = (x,(u), yu), 2)(u)) if we had 
been working in three dimensions. The segment was then extended by introducing p;.3, 
deleting p;_, and generating the line forO =u = 1. 


The process is continued until we have B,_5. Finally, the first and last segments are 
generated by starting with po, po. Po, Pp, and ending with py). Pas Pas Pn- 

In an analogous manner the interpolating B-spline surface patch depends on 16 points, 
as Figure 3.10 shows. 
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Figure 3.10 
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Here p; ; = (x;,;, y;,j. Zi), 4 point in E%. This patch is generated by computing the points 
P;,j(u, v), for 0 = u = 1 and 0 = v < 1. Here we have changed the subscripts on the 
points p; ; So as to fit into matrix notation. 

For simplicity, we will consider only the x-coordinate in detail. Comparable for- 
mulations hold for the y- and z-coordinates. The simplest formulation for x; (u, v) is 
based on the matrix formulation of Eq. (3.29) and is given by 


3 
2 

xij(, v) = 1/36 (08, w2, u, 1IM,X, MB)" |, (3.31) 
1 


where X; ; is the 4 x 4 matrix 


Xj-1,j-1 %i-1,j *i-1,j+1 F-15742 
Xij-1 Xj Xi j+1 Xj j+2 


Xisaj-1 Xi+ay Xit1j+1 Xi+1,j+2 
Xp+2,j-1 %i+2,7 %i+2,j+1  *%i+2,j+2 


which are just the x-coordinates of the 16 points of Figure 3.10. The matrix M, is the 
matrix we saw before in Eq. (3.29). 


The y and z equations are then obtained by merely substituting the corresponding matrices 
Y,,, and Z, ;, which are formed from the y and z components of the 16 points. Since each 
of these equations is cubic in wu and v, they are referred to as bicubic equations. The 


3.12 
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coordinates of the points on a patch are given by 
x(u, v) = 1/36[u3, u?, u, 1)M,X, Mi Lv, vv, 17, 
yu, v) = 1/36[u3, w?, u, 1M, Y, ;Mplv, yeu LIF, 
2(u, v) = 1/36[u3, u?, uy 1M,Z, jMIIv3, v2, v, 1", 


as u and v range between 0 and 1. It is easily verified that the weights applied to each 
of the 16 points are 


1 410 

4 16 4 0 

| 410 at p,_(u, v(for u = 0, v = 0), and 
0 0 0 0 

00 0 0 

Oo 1 4 

0 4 16 4] MPA MMforu = 1 = 1) 
Od 4 7 


where each (i, /)th element is the coefficient for the corresponding point in Figure 3.10. 
In effect, these matrices are templates that overlay the points shown in Figure 3.10 

The surface patch is extended by adding another row or column of points and deleting 
a corresponding row or column of points. One should verify that the current and previous 
patches are connected smoothly along the edge where they join. An initial or final patch 
can be obtained by repeating a corner, as was suggested for the B-spline curve. This will 
ensure that the patch actually starts or ends at a point. For the surface we would repeat 
a point nine times instead of three times as was done for the curve. 

For a more detailed and informative discussion of interpolating curves and surfaces, 
the reader should consult Pokorny and Gerald (1989) 


CHAPTER SUMMARY 


If you have understood this chapter on interpolation, you are now able to 


1. Construct interpolating polynomials for any set of data by (a) use of the Lagrangian 
formulation or (b) use of divided differences, You can explain why the divided- 
difference technique is preferred. You can use either to interpolate from a set of 
given data pairs. 

2. Write and use an expression that computes the error of an approximation. You know 
the difficulty and limitations of such error estimation. 


3. Use a table of differences of evenly spaced data to construct a variety of interpolation 
polynomials (Newton—Gregory forward and backward, Gauss forward and back- 
ward, Bessel, and Stirling). You understand that these are merely different ways to 
create the same polynomial, but you also can select the proper one in situations 
where one is preferred. 
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Figure 3.11 
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PROGRAM DIVDIF (INPUT, OUTPUT) 


THIS PROGRAM USES THE DIVIDED DIFFERENCES METHOD TO 
INTERPOLATING POLYNOMIAL THAT GOES THROUGH A GIVEN S: 
POINTS. 


SUBROUTINE DOCOEF: FINDS THE COEFFICIENTS OF 
INTERPOLATING POLYNOMIAL 
FUNCTION EVAL: EVALUATES THE DIVIDED DIFFE! 
AT A GIVEN VALUE U. 


REAL X(5), 
INTEGER I, J, N 


DATA N/5/ 
DATA X/3.2,2.7,1. 
DATA 


CALL DOCOEFF (X,Y, CO! 


PRINT ’ (///)’ 
i, € 


* (SF9.5)", 
al 2,9 fa 


INPUT: X,Y - ATA POINTS 
N - MBER OF DATA POINTS 
OUTPUT: COEFFS Ee Cc CIENTS OF P(X) 


INTEGER N, 1I,d,K 
REAL X(N),¥(N) ,COEFFS( 


DO 10 I =1,N 
COEFFS(I) = ¥(I) 
CONTINUE 


Program 1. 


i 
{ 
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Figure 3.11 (continued) 


DO 20 J = 2,N img 
TEMP1 = COEFFS(J-1) 
DO 30 K = J,N 
TEMP2 = COEFFS(K) 
COEFFS(K) = (COEFFS(K) - TEMP1)/(X(K) - X(K-J+1)) 
TEMP1 = TEMP2 


REAL FUNCTION EVAL(U, COEFFS, X, N) 


INPUT: U - THE X-VALUE 
OUTPUT: EVAL - THE CORRESPONDING Y-VALUE, I.E. P(U) 


REAL U, COEFFS(N), X(N), suM 
INTEGER N, TI 


SUM = 0.0 


DO 10 I = N,2,-1 
SUM = (SUM + COEFFS(I)) * (U - X(I-1)) 
CONTINUE 


SUM = SUM + COEFFS(1) 
EVAL = SUM 

RETURN 

END 


OUTPUT FOR PROGRAM i 


THE COEFFICIENTS FOR THE POLYNOMIAL ARE: 


22.00000 8.40000 2.85561 -.52748 «25584 


U P(U) 


1.000 14.20000 
1.200 12.89775 
1.400 12.25424 
1.600 12.16434 
2.800 12.53272 
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Figure 3.11 (continued) 


2.000 
2.200 
2.400 
2.600 
2.800 
3.000 
3.200 
3.400 
3.600 
3.800 
4.000 
4,200 
4.400 
4.600 
4.800 
5.000 
5.200 
5.400 
5.600 


13.27390 
14,31221 
15.58181 
17,02667 
18.60059 
20,26722 
22.00000 
23.78221 
25.60694 
27.47713 
29.40553 
31.41470 
33.53706 
35,81481 
38,30000 
41.05451 
44,15003 
47. 66808 
51.70000 


SUBROUTINE INTERP (X1,XN,H,N,¥,X,M, YOUT) 


X1,XN 


EMM aE 


YOUT 
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SUBROUTINE INTERP 


SUBROUTINE TO INTERPOLATE IN A TABLE OF 


UNIFORMLY SPACED VALUES. 


PARAMETERS ARE : 


BEGINNING AND ENDING X VALUES 

DELTA X - THE UNIFORM SPACING 

NUMBER OF ENTRIES IN THE TABLE 

ARRAY OF FUNCTION VALUES 

- VALUE AT WHICH Y IS TO BE INTERPOLATED 

- DEGREE OF INTERPOLATING POLYNOMIAL. THE SUBROUTINE WILL 
HANDLE UP TO 10TH DEGREE, BUT USUALLY THE DEGREE WILL BE 
LESS THAN 10 TO AVOID ROUND OFF ERRORS. 

- THE INTERPOLATED Y VALUE RETURNED TO THE CALLER 


AN ARRAY D IS USED IN THE SUBROUTINE TO HOLD DELTA Y’S. 


REAL X1,XN,H,¥(N) ,X, YOUT 
INTEGER N,M,I,J,K 
REAL D(10),FM,FJ,X0,Y0,S,FNUM, DEN, FI 


aaq0aan0 


FIRST FIND PROPER SUBSCRIPT FOR YO SO THAT X IS CENTERED IN THE 
DOMAIN AS WELL AS POSSIBLE. THIS SUBSCRIPT VALUE IS CALLED J. 


Figure 3.12 Program 2. 
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Figure 3.13 (continued) 


REAL X(20),¥ (20) ,XINT, YOUT, DELX, ERROR, DIFF,A,B,F,H,COEFFS (20) fi 
INTEGER N,K,J | 
DATA A,B / -1.0,1.0 / \ 


c 
| 
C INITIALIZE VARIABLES AND STORE INFORMATION IN X AND Y ARRAYS j 
c 
DELX = (B - A) / 64.0 
PRINT 100 
DO 20 K = 11,19,2 
H = (B - A) / (K - 1) 
pO 10 J = 1,K 
X(J) = -1.0 + (J - 1)*H 
Y(J) = F(X(d)) 
10 CONTINUE ' 
ERROR = 0.0 ' 
c 
ae Ne, ee 
ted 
CALL DOCOEFF (X,Y,COEFFS,K) { 
Cc 
DO 15 XINT = -1.0,1.0,DELX 
YOUT = EVAL (XINT,COEFFS,X,K) 
ERROR = MAX( ABS(F(XINT) - YOUT), ERROR ) 
15 CONTINUE ' 
PRINT 200, K-1,ERROR joo 
20 CONTINUE 1 
PRINT 300 
c 1 
a | 
: p 
100 FORMAT (//T11,’DEGREE’,T22,'MAX ERROR FOUND’ /) 
200 FORMAT (T12,13,T24,F11.7) 
300 FORMAT (///) 
STOP 
END 
c 
Seg aca ne Onde, Se 
c 
REAL FUNCTION F(T) 


REAL T as 
F=1.0 / (1.0 + 25.0*T*T) 
RETURN 
END 
f 
OUTPUT FOR PROGRAM 3 Wy 


DEGREE MAX ERROR FOUND 


10 
12 


1.9132554 
3.4710393 
14 7.0075584 
4. 
Di 


16 
18 


3699041 
0364012 
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The fourth program (Fig. 3.14) implements the cubic spline method on the function: 
y = x3 — 8 (Example | in Section 3.9). All four conditions, called IEND, are tested on 
this function. The program has two subprograms, CUBSPL and SEVAL. SUBROUTINE 
CUBSPL uses the data points, x,y-pairs to generate the matrix appropriate to the condition, 
IEND = 1 through 4. The matrix is in tridiagonal form. The subroutine then produces 
the cubic polynomial coefficients for each subinterval. FUNCTION SEVAL returns the 
appropriate value for a given x = u. SEVAL finds the interval containing u, and uses the 
polynomial to compute the spline value. For this example we get as output: the matrix 
in tridiagonal form, the coefficients of the cubic polynomials, the second derivatives, and 
the spline values compared to the exact function values. 


PROGRAM CUBIC (INPUT, OUTPUT) 


THIS PROGRAM FITS DATA WITH A CUBIC INTERPOLATING 
SPLINE. THERE ARE FOUR KINDS OF SPLINES DEPENDING 
ON IEND = 1,2,3,OR 4. 
IEND = 1 : LINEAR END CONDITION: S(0) = S(N) 
IEND = 2 : PARABOLIC END CONDITION: S(0)=S(1), 

S (N-1) =S (N) 
IEND = 3. : CUBIC END CONDITION 
IEND = 4 ; FIRST DERIVATIVE IS KNOWN AT X(0) AND 


X(N). THE VALUES ARE STORED IN S(0) AND 
S(N) RESPECTIVELY. 


THIS EXAMPLE FITS THE APPROPRIATE CUBIC SPLINE 
FOR THE FOUR POINTS ON THE CURVE DEFINED BY 
THE FUNCTION: F(X) = X*X*X - 8 


aaaanaaAaaAaaaaaaaaAaAa 


INTEGER N,I,IEND 
REAL X(100) ,¥ (100) ,A(100) ,B(100) ,C(100) ,S (100) 
DATA N, IEND/5,1/ 
DATA X/0,1,2,3,4,95*0.00000000/ 
DATA Y/-8,-7,0,19,56,95*0.0/ 
DATA S/100*0.0/ 
READ *, IEND 
IF (IEND .EQ. 4) THEN 
S(1) = 0.0 
S(N) = 48.000000 


PRINT ‘ (///)" 
PRINT *, ’ OUTPUT FOR IEND = ‘', TEND 
PRINT ’ (//)" 


CALL CUBSPL(X,¥,IEND,N,A,B,C,S) 


PRINT 98 
PRINT 99 
PRINT 101, (S(I), I=1,N) 
PRINT 98 
PRINT 200 
DO 31 U = 0.0,4.01,0.25 


Figure 3.14 Program 4. 
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Figure 3.14 (continued) 


TEMP1 = SEVAL(N,U,X,¥,A,B,C) 
TEMP2 = FCN(U) 
PRINT 100, U,TEMP1,TEMP2, TEMP1 - TEMP2 
CONTINUE 

PRINT 98 

FORMAT (//T2,60("*")/) 

FORMAT (’ THE SECOND DERIVATIVES S(I) ARE; '/) 


100 FORMAT (3X,F4.2,4X,2F10.4,4X,F10.5) 
101 FORMAT (10F9,4) 


200 FORMAT(TS,’ 


AANAAAAAAAAANAAAAAAAAAARAAAAAAA ABAAA 


aaaaa 


X’,T15,/ SPLINE (X)/,726,' F(X)’,T41,/DIFE’//) 
STOP 
END 


SUBROUTINE CUBSPL (X,Y, IEND,N,A,B,C,S) 


SUBROUTINE CUBSPL : 

THIS ROUTINE COMPUTES THE MATRIX 
FOR FINDING THE COEFFICIENTS OF A CUBIC SPLINE THROUGH 
A SET OF DATA. THE SYSTEM THEN IS SOLVED TO OBTAIN THE 
SECOND DERIVATIVE VALUES AND THE COEFFICIENTS OF THE CUBIC 
SPLINE POLYNOMIALS. 


PARAMETERS ARE : 


X,Y : ARRAYS OF X AND Y VALUES TO BE FITTED 

NUMBER OF POINTS 

TYPE OF END CONDITION TO BE USED 

IEND = 1, LINEAR ENDS, S(1) = S(N) = 0. 

IEND = 2, PARABOLIC ENDS, S(1) = S(2), S(N) = S(N-1) 

IEND = 3, CUBIC ENDS, S(1), S(N), ARE EXTRAPOLATED 

IEND = 4, THE FIRST DERIVATIVES ARE GIVEN AT EITHER E 
FPRIME(X1) IS STORED IN S(1) ON INPUT, AND 
IS STORED IN S(N). 

SMATRIX : AUGMENTED MATRIX OF COEFFICIENTS AND R.H.S. FOR 

FINDING S. 
A,B,C : ARRAYS OF SPLINE COEFFICIENTS 


REAL X(N), ¥(N),S(N) ,A(N) ,B(N) ,C(N) ,SMATRIX (0:20, 4) 
+ ,DX1,DY1,DX2,D¥2, DAN1, DXN2 

REAL H(20) 

INTEGER N, IEND,NM1,NM2,I,J,FIRST, LAST 


COMPUTE FOR THE N-2 ROWS 


NM2 = N- 2 

NM1 =N-1 

DX1 = X(2) - X(1) 

DY1l = ( ¥(2) - ¥(1) ) / DK1 * 6.0 


DO 10 I = 1,NM2 
DXx2 MAI + 2) = A(T + 2) 
¢ XOE #2) = Yr + 1) ) / D2 * 6.0 
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Figure 3.14 (continued) 


SMATRIX(I,1) = DX} 
SMATRIX(I,2) = 2.0 * (DX1 + DX2) 
SMATRIX(I,3) = DX2 
SMATRIX(I,4) = DY2 - DY1 
DX1 = DX2 
DY1 = DY2 
10 CONTINUE 
FIRST = 2 
LAST = NM2 
c 
c 
c 
C ADJUST FIRST AND LAST ROWS APPROPRIATE TO END CONDITION. 
c 
c 
C FOR IEND = 1, NO CHANGE IS NEEDED. 
c 
IF ( IEND .EQ.2 ) THEN 
c 
C FOR IEND = 2, S(1) = S(2), S(N) = S(N-1), PARABOLIC ENDS. 
c 
50 SMATRIX(1,2) = SMATRIX(1,2) + X(2) - X(1) 
SMATRIX(NM2,2) = SMATRIX(NM2,2) + X(N) - X(NM1) 
c 
eS ELSE IF (IEND .EQ. 3 ) THEN 
C FOR IEND = 3, CUBIC ENDS, S(1), S(N) ARE EXTRAPOLATED. 
Cc 
80 DX1 = X(2) - X(1) 
DX2 = X(3) - X(2) 
SMATRIX(1,2) = (DX1 + DX2) * (DX1 + 2.0 * DX2) / DX2 
SMATRIX(1,3) = (DX2 * DX2 - DX1 * DX1) / DX2 
DXN2 = X(NM1) - X(NM2) 
DXN1 = X(N) - X(NM1) 
SMATRIX(NM2,1) = (DXN2 * DXN2 - DXN1 * DXN1) / DXN2 
SMATRIX(NM2,2) = (DXN1 + DXN2) * (DXN1 + 2.0 * DXN2) / DXN2 
c 
END IF 
IF (IEND .EQ. 4) THEN 
DX1 = X(2) - X(1) 
DY1 = (¥(2) - ¥(1))/DxX1 
SMATRIX(0,1) = 1.0 
SMATRIX(0,2) = 2.0*DX1 
SMATRIX (0,3) = DX1 
SMATRIX(0,4) = (D¥1 - S(1))*6 
DX1 = X(N) - X(N-1) 
DY1 = (Y(N) - Y(N-1))/DX1 
SMATRIX(NM1,1) = DX1 
SMATRIX (NM1, 2) 2.0*DX1 
SMATRIX (NM1, 3) 0.0 
SMATRIX(NM1,4) = (S(N) - DY1)*6.0 
FIRST = 1 
LAST = N-1 
END IF 
c 
c PRINT OUT TRIDIAGONAL MATRIX IN COMPACT FORM 
Cc 
Cc 
DO 11 I = FIRST-1,LAST 
11 PRINT *, (SMATRIX(I,J), J = 1,4) 
G 
c 
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Figure 3.14 (continued) 


C NOW WE SOLVE THE TRIDIAGONAL SYSTEM. FIRST REDUCE. 
Cc 
DO 110 I = FIRST,LAST 


SMATRIX(I,1) = SMATRIX(I,1) /SMATRIX(I-1,2) 
SMATRIX(I,2) = SMATRIX(I,2) - SMATRIX(I,1) * SMATRIX(I-1,3) 
SMATRIX(I,4) = SMATRIX(I,4) - SMATRIX(I,1) * SMATRIX(I-1,4) 
110 CONTINUE 
Cc 
A, Saseeceios eet spose ca ineiadneenme petnanineansneee ses 
ic 
C NOW WE BACK SUBSTITUTE 
c 
SMATRIX (LAST, 4) = SMATRIX(LAST,4) / SMATRIX (LAST, 2) 
DO 120 J = LAST-1,FIRST-1,-1 
SMATRIX(J,4) = ( SMATRIX(J,4) - SMATRIX(J,3) * SMATRIX(J+1,4) ) 
+ / SMATRIX(J,2) 


120 CONTINUE 


NOW PUT THE VALUES INTO THE S VECTOR 


aaaaaa 


DO 130 I = FIRST-1,LAST 
S(I+1) = SMATRIX(I,4) 
130 CONTINUE 


(34 
0 SeSeeses neta ease eee seen see a esse eeuatceeaeee sees 
c 
C GET §(1) AND S(N) ACCORDING TO END CONDITIONS 
c 
IF (IEND .EQ. 1) THEN 
c 
IC Seaeaesnnanecanceeeweensuneweenaeeeneaeeuneeneawaaeuees 
Cc 
C FOR LINEAR ENDS, S(1) = 0, S(N) = 0. 
c 
S(1) = 0.0 
S(N) = 0.0 
ELSE IF (IEND .EQ. 2) THEN 
¢ 
c 
Cc 
C FOR PARABOLIC ENDS, S(1) = S(2), S(N) = S(N-1). 
Cc 
$(1) = S(2) 
S(N) = S(N-1) 
c 
Gl purietsetebncoe sows cecae sete lea ee ece sacs Se sness aces 
Cc 
C FOR CUBIC ENDS, EXTRAPOLATE TO GET S(1) AND S(N). 
c 
ELSE IF (IEND .EQ. 3) THEN 
S(1) = ((DX1 + DX2) * S(2) - DX1 * S(3)) / Dx2 
S(N) = ((DXN2 + DXN1) * S(NM1) - DXN1 * S(NM2)) / DXN2 
END IF 
Cc 
Cc WRITE OUT COEFFICIENTS OF THE POLYNOMIALS 
c 
PRINT 99 
99 FORMAT (/T2,60('*')//T10,’THE CUBIC POLYNOMIALS, G(X)’ 


+ ,’ DEFINED ON THE INTERVALS’ //) 
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Figure 3.14 


ana 


aAAAAAAAAAARAAAAAAAAA 


(continued) 


PRINT 101 — 
DO 200 I = 1,N-1 
H(I) = X(I+1) — x¢z) 
A(I) = (S(T+1) - S(T)) / (6 * H(Z)) 
B(I) = S(I) / 2 
C(I) = (¢¥ (441) - ¥(T)) / B(E)) - ((2 * HOE) * S(T) + i 
H(I)*S(I+1)) / 6) 


PRINT 102,I, A(I),B(I),C(I),¥(1) 

FORMAT (T5,’I’,T17,’A’ ,T27,’B’ ,T37,’C’,T47,'D" /) 
FORMAT (T5,11,T12,F8.4,T22,F8.4,T32,F8.4,T42,F8.4) 
CONTINUE 

RETURN 

END 


REAL FUNCTION SEVAL(N,U,X,Y,A,B,C) 
INTEGER N 
REAL U,X(N),Y(N),B(N),C(N),A(N) 


THIS SUBROUTINE EVALUATES THE CUBIC SPLINE FUNCTION 
SEVAL=Y (I) +C (I) * (U-X (I) ) +B (I) * (U-K (I) **2+A (I) * (U-X (I) ) **3 
WHERE X(I).LT.U.LT.X(I+1), USING HORNER’S RULE 


IF U.LT.X(1) THEN I=1 IS USED 
ELSE I=N IS USED 


N = THE NUMBER OF DATA POINTS 

U = THE ABSCISSA AT WHICH THE SPLINE IS TO BE EVALUATED 

X,Y = THE ARRAYS OF DATA ABSCISSAS AND ORDINATES 

B,C,A = THE ARRAYS OF SPLINE COEFFICIENTS COMPUTEA BY SPLINE 
IF U IS NOT IN THE SAME INTERVAL AS THE PREVIOUS CALL, THEN A 
SEARCH IS PERFORMED TO DETERMINE THE PROPER INTERVAL. 


INTEGER I,J,K 
REAL DX 
SAVE I 
DATA I/1i/ 
IF(I.GE.N) THEN 
THEN I=i 
ENDIF 
IF(U.GE.X(N)) THEN 
DX=U-X (N-1) 
SEVAL=¥ (N-1) +DX* (C (N-1) +DX* (B (N-1) +DX*A(N-1) )) 
RETURN 
END IF 
IF(U.GE.X(I)) THEN 
IF(U.LE.X(I+1)) THEN 
DX=U-X (I) 
SEVAL=Y (I) +Dx* (C (I) +DX* (B(I) +DX*A(I))) 
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Figure 3.14 (continued) 


10 


aaa 


aaaaaaa 


K= (I+) /2 

IF(U.LT.X(K)) THEN 
J=K 

ELSE 
I=k 

ENDIF 

IF(J.GT,I+1) THEN 
GOTO 10 

ELSE 


EVALUATE SPLINE 
DX=U-X (I) 


SEVAL=Y (I) +DX* (C(I) +DX* (B(I) +DX*A(I))) 
RETURN 


REAL FUNCTION FCN(X) 
REAL EXP,X 

FCN = X**3 - 8.0 
RETURN 

END 


OUTPUT FOR PROGRAM 4 


OUTPUT FOR IEND = 1 


1. 4. 1. 36, 
Li Qe La VQ 
1, 4. 1. 2108. 


OO IOI III ICICI III ICICI TOI ICICI ROI III I IO ITO IO A I 


THE CUBIC POLYNOMIALS, G(X) DEFINED ON THE INTERVALS 


Z A B c D 

1 1.0714 .0000 -,0714 -8,0000 
2 6429 3.2143 3.1429 -7.0000 
3 2.3571 5.1429 11,5000 .0000 
4 -4.0714 12.2143 28.8571 19.0000 


FO III IOI TOI II IOI III III I III III IOI IO TIRE tte 


THE SECOND DERIVATIVES S(I) ARE; 


-0000 6.4286 10.2857 24.4286 +0000 


OI IOI IO IOI IO IOI RICO IOI ICICI IOI IO IOI tte 
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ae SPLISE(x) Fa) DIEE 
-00 —8.0000 -acoce 
-25 8.0011 —. 01674 
-50 -7.9018 -.02679 
75 —7.6026 ~.02344 
2.00 -7.0000 - 00000 
2.25 -6.0033 -04353 
2.50 —4.5446 -08036 
1.75 -2.5636 -O770E 
2.00 -0006 -Gaoee 
2.25 3.2333 -.15737 
2.50 7.3304 —.29464 
2.75 12.5123 ~.22460 
3.00 19.0000 -0e000 
3.25 26.9142 -58594 
3.50 35.9732 1.09622 
3.75 45.7958 2.06238 
4.00 56.0000 86600 


STREET EERE TEE TERETE E REET ET ETE EERE TET ee 


OUTPUT FOR TEND = 2 


2k. Se -B.. BG 
L. € 2. 2. 
4. 5S. 1. 106. 


es 


THE CUBIC POLYNOMIALS, G(X) DEFINED ON THE INTERVALS 


a 5 c D 


0000 2.4000 -1.4000 -8.0000 
1.2000 2.4000 3.4000 -7.0000 
1.2000 6.0000 12.8000 0000 

-G000 3.6000 27.4000 159.0000 


Cao! 


eet e eee eter tere tee e tet e ce terre cece teers cece eee eeeereeee 


THE SECOND DERIVATIVES S(i} ARE; 
4.8000 4.8000 12.0000 19.2000 19.2000 


SRE RR eRe ee re eee reer rete errr eee eee Terre etree 


x SPLINE (x) Fm) DEEF 


-00 -8.0000 -8.0000 -sacee 
-25 —8.2000 -7.9844 —.21563 
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Figure 3.14 (continued) 


-50 -8.1000 -7.8750 -.22500 

+75 -7.7000  -7.5781 -.12188 
1.00 -7.0000 -7.0000 - 00000 
Ve25 -5.9813 -6.0469 -06563 
1.50 -4.5500 -4.6250 07500 
1.75 -2.5938 -2.6406 -04688 
2.00 -0000 .0000 -00000 
2.25 3.3438 3.3906 -.04688 
2.50 7,5500 7.6250 -.07500 
2.75 12,7313 12.7969 -.06563 
3.00 19.0000 19.0000 - 00000 
3.25 26.4500 26.3281 «12188 
3.50 35.1000 34.8750 +22500 
E eit 44,9500 44.7344 + 21563 
4.00 56.0000 56.0000 + 00000 


REE EEE EEN EERE REE EER EEE EERE EERE EERE REED 


OUTPUT FOR IEND = 3 


1. 6. 0. 36. 7 


Ly 4. 1s. 725 
0. 6. 1. 108. . 
REE EERE EEE EERE EERE EEE EEE RHEE EER EEE EEE EEE ‘\ 


THE CUBIC POLYNOMIALS, G(X) DEFINED ON THE INTERVALS i 


I A B Cc D 
| 

1 1.0000 -0000 +0000 -8.0000 of 

2 1.0000 3.0000 3.0000 -7.0000 

3 1.0000 6.0000 12.0000 +0000 

& ~ 1.0000 39,0000 27.0000 19.0000 j 
PERE R ERE REE EERE EEE EEE EERE ERE TREE ESE E EERE ERE EEE ER TEER Ee y 
THE SECOND DERIVATIVES S(I) ARE; | 

~0000 6.0000 12.0000 18,0000 24.0000 i 


PERE R ERE E EE EEE EERE ERE R EERE EERE EEE E EERE E EERE EEE EE EEE 


x SPLINE (X) F(X) DIFF 
+00 -8,0000 -8.0000 -00000 
«25 -7,.9844 -7,9844 -00000 
.50 -7.8750 -7.8750 00000 
+75 -7,5781 -7.5781 00000 

1.00 -7.0000 -7.0000 - 00000 
1.25 -6.0469 -6.0469 -00000 


4.6250 00000 
—2.6406 - 00000 
ooo 00000 
3.3906 -00000 
7.6250 -00000 
12.7969 00000 
219.0000 -00000 
26.3282 -00000 
34.8750 -00000 
44.7344 ooace 
56.0000 occa 


OUTPUT FOR IEND = 4 


Wo te bee 
Neen 
Or eee 


ENE EERE ERE EE ET EEE EERE EEE TEETER EET TREE ES 


THE CUBIC POLYNOMIALS, 


ey 


Bune 


Perrrerretettt trier ttt littl till) l. lls tees 


A 


1.0000 
1.0000 
1.0000 
1.0000 


8B c 


+0000 -0000 
3.0000 3.0000 
6.0000 212.0000 
9.0000 27.0000 


THE SECOND DERIVATIVES S(I) ARE; 


0000 


EE EERE EEE EE EEE TEESE TESTS TEE EET ETT TT ES 


G(X) DEFINED ON THE INTERVALS 


12) 


-8.0000 
-7.0000 

-0000 
29.0000 


6.0000 12.0000 18.0000 24.0000 


SPLINE (X) 


-8.0000 
-7.9844 
-7.8750 
-7.5781 
-7.0000 
-6.0469 
—4.6250 
—2.6406 


F(x) 


DIEF 


00000 
00000 
00000 
-00000 
- 00000 
-ooac0 
00000 
- 00000 
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Figure 3.14 (continued) 


2.00 -0000 -0000 + 00000 
225 3.3906 3.3906 - 00000 
2.50 7.6250 7.6250 + 00000 
2325 12.7969 12.7969 - 00000 
3.00 19,0000 19.0000 - 00000 
3.25 26.3281 26.3281 . 00000 
3.50 34.8750 34.8750 .00000 
3.75 44.7344 44.7344 .00000 
4.00 56.0000 56.0000 00000 


FIO IOI IO IOI IOI IOI IOI I OIA TOI ICICI ROO ICICI IOI IOI AOTC I IO ICICI III IOI I IO Oe 


The fifth and final program (Fig. 3.15) is a Pascal program that can generate a cubic 
B-spline curve. A graphics package is needed to interface with this program. The curve 
is actually a set of straight lines. The program allows up to 20 straight lines to make up 
a segment of the curve determined by any four control points. That information is stored 
in the variable, lines_per_section. In PROCEDURE set_blending—functions, the values 
of the four B-spline functions on the interval [0, 1] are evaluated for a fixed number of 
points. The values are stored in an array called BLEND, which is 4 by V where N 
20. After a segment determined by 4 control points (p;, pj+ 1, Pj+2, Pi+a) has been drawn, 

p; is dropped (PROCEDURE next_section) and a new point p;,4 is added (PROCEDURE 

put_in_sm). The data points for this program were stored in a file called “bspline2.dat”, 

This Turbo Pascal program uses a graphics package based on the HP AGIOS (Alpha- 
numeric/Graphics Input/Output Subsystem) functions. The output from this program is |) 
the B-spline graphs used in this chapter. H 


PROGRAM cubic_B_ spline (INPUT,OUTPUT, f); 
(* 


This program implements the Algorithms in Chapter 11 of 
S. Harrington but the first and last segments of the curve 
are handled in a simpler manner. Finally, only the cubic 
B-spline functions are generated 


It is assumed that one has graphics commands such as 


PROCEDURE drawline(xl,yl, %2,y2  ; INTEGER); | 
PROCEDURE plot (x,y : INTEGER ); 
PROCEDURE graphON; 

PROCEDURE graphOFF; 

PROCEDURE clearGRAPHICS; 


These and others are usually available in a graphics package. 

The one used in this program as based on the Hewlett-Packard 

AGIOS (ALPHANUMERIC/GRAPHICS INPUT/OUTPUT SUBSYSTEM) functions 
for the HP 150. 


my 


Figure 3.15 Program 5. 
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Figure 3.15 (continued) 


CONST —_ 
max_number_of lines = 20; (* each curve segment consists 
of 20 straight lines i | 
max_number of points = 20; 
TYPE 
matrix = ARRAY[1..4,1..max_number of lines] OF REAL; 
vector = ARRAY[1..max_number of points] OF REAL; 


vector4 = ARRAY[1..4] OF REAL; 


VAR 
blend 3 Matrix; 
xsm,ysm (* set of four points to create current 
segment of B-spline curve i | 
: vector4; 
ax,ay (* data points as stored in the array ‘*) 
: vector; 
x0, yO : REAL; 
lines_per_section, i, 
number_of_points : INTEGER; 
f£ : TEXT; (* data file that stores the number of 
points and the data points (x,y) in 
screen coordinates, ah 


INITIALIZE THE BLENDING FUNCTIONS 
This procedure evaluates the B-spline basis functions at 
the number_of_lines points of u on the interval [0,1]. Since 
these functions are independent of the control points, they 
can be done just once and stored in a 4 x max_number_of_lines 
array. 
=} 
PROCEDURE set_blending_functions(number_of_lines : INTEGER); 
VAR 
i, j : INTEGER; 
u, u_cube, u_square, 


uMINUS1_cube : REAL; 
BEGIN 

FOR i 1 TO number_of_lines DO BEGIN 
u := i/number_of_lines; 
u_square := u*u; u_cube := u*u_square; 
uMINUS1_cube := (1.0-u) *SQR(1.0-u); 
blend[1,i] uMINUS1_cube/6.0; 
blend[4,i] u_cube/6; 
blend[3,i] (=u_cube/2 + u_square/2 + u/2 + 1.0/6); 
blend(2,i] (u_Cube/2 - u_square + 2/3); 


END; 


END; (* INITIALIZE BLENDING FUNCTIONS *) 
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Figure 3.15 (continued) 


* 
PUT_IN_SM 


This procedure places a new sample point into the _sm arrays. 
*) 
PROCEDURE put_in_sm(x,y : REAL); 
BEGIN 


MAKE_CURVE 


This procedure fills in a section of the curve, 
wil 
PROCEDURE make_curve(VAR b: matrix); 


VAR 
4 : INTEGER; 
x/Y : REAL; 
BEGIN 
FOR j := 1 TO lines _per_section DO BEGIN 
x := 0.0; y z= 0.0; 
FOR i := 1 TO 4 DO BEGIN 


x 1= x + xsm(i)]*b(i,j)7 
y t= y + ysm(i]*b(i,3) END; 
drawline (ROUND(x0), ROUND(y0), ROUND (x), ROUND(y)); 
x0 := x; yO :=y END; 


END; (* make_curve *) 


* 


NEXT_SECTION 


After a section of the curve has been drawn, we shift the sample points 
so that our blending functions can be applied to the next section. 


i 
PROCEDURE next_section; 


VAR 
i: INTEGER; 
BEGIN 
FOR i := 1 TO 3 DO BEGIN 
xsm[i] := xsm[it+l]; 
ysm(i) ysm[it+1] END 
END; 
(* 
CURVE_ABS_2 


This procedure extends the curve by taking a new sample 
point as its argument and stores it into the _sm arrays, 
a 
PROCEDURE curve_abs_2(x,y : REAL); 
BEGIN 
put_in_sm(x,y); 
make_curve (blend) ; 
next_section 
END; (* curve_abs 2 *) 
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Figure 3.15 (continued) 


BEGIN (* MAIN *) 


*) 


number _of lines ;= 10; 
gotoxY(1,1); ClrScr; graphON; clearGRAPHICS; 


ASSIGN (f,’b:bspline2.dat’); 
RESET (f) ; 
WHILE NOT EOF(f) DO BEGIN 


initialize; 


Create a box about the display 


drawLINE (0,0,0,389); 
drawLINE (0,389, 511,389); 
drawLINE (511,389, 511,0); 
drawLINE(511,0, 0,0); 


set_blending functions (lines per section); 
start_B_ spline (ax,ay); - 


FOR i := 3 TO number of points DO 


curve_abs_2(ax[i];ay[i]); 
end_B_ spline; 


REPEAT UNTIL keyPRESSED; 
clearALPHA; 


clearGRAPHICS END; (* WHILE *) 


graphoFF; 
CLOSE (f) 


END. 


255 


EXERCISES 


Section 3.2 


1, Write an interpolating polynomial that passes through each point: 


5 2 1 


y +32. 1.6 ==1.5) 
Plot the points and sketch the parabola that passes through them, 


> 2. Given the four points (1, 0), (—2, 15), (—1, 0), (2, 9), write the Lagrangian form of the 
cubic that passes through them. Multiply out each term to express in standard form, ax> + 
bx? + ex + d. 
3. Given that In 2 = 0.69315 and In 5 = 1.60944, compute the natural logarithms of each 
integer from 1 to 10. Compare to a five-place table. 
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+ 


If &? is estimated by imterpolation among the values of & = 1, #2 = 3.1052. and &? = 
1.3499, find the maximum and minimum estimates of the ewor. Compare wath the actual 
exror. 


Repeat Exercise 4, except use extrapolation to pet €*. 


Section 3.3 


6. 


Construct the divided-difference table for these points: 


x fx) 


—0.2 1.3940 
05 1.0025 
0.1 1.1221 
07 1.0085 
0.0 1.1884 


Repeat Exercise 2, except use divided differences. Compare the resulting polynomial @ 
standard form with that obtained in Exercise 2 

Use the divided-difference table from Exercise 6 to estimate 7(0.15). using: 
a) polynomial of degree 2 through the first three points. 

b) polynomial of degree 2 through the lest three pounts 

c) polynomial of degree 3 through the first four points 

d) polynomial of degree 3 through the last four points 

©) polynomial of degree 4. 

Why are the results different? 

Repeat Exercise 4. but now use divided differences. 

Repeat Exercise 5, but now use divided differences. 


Section 3.4 


i] 


Complete a difference table for the following data: 
x 1.20 125 1.30 135 1.40 145 150 


fw 0.1823 0.2231 0.2624 0.3001 0.3365 0.3716 0.4055 


In Exercise 11. what degree of polynomial is requized to i exactly 10 all seven date pairs? 
What lesser-degree polynomial will nearly fit the date? Justify your answer. 

Form a difference table for f(x) = x7 — 32° + 2x + 1 forx = —1(0.2)1. (Recall that acsied 
multiplication is more efficient, especially on 2 desk calculator.) Verify from your table the 
validity of Eq. (3.5). 


Use Newton—Gregory forward-imtespolating polynomials of degree 3 10 estimate 7(0.158) and 
F(0.636), given the following table. In the first polynomial. choose x, = 0.125. In the second, 
let x = 0.375 


15. 
16. 
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x f(x) Af “yf ay “yp 

0.125 0.79168 
—0.01834 

0.250 0.77334 0.01129 
—0.02963 0.00134 

0.375 0.74371 —0.00995 0.00038 
—0.03958 0.00172 

0,500 0.70413 —0.00823 0.00028 
—0.04781 0.00200 

0.625 0.65632 —0.00623 
—0.05404 

0.750 0.60228 


Add one term to the work of Exercise 14 to estimate /(0.158) by a fouth-degree polynomial. 


Use the data below to find the value of y atx = 0.58, using a cubic polynomial that fits the 
table at x-values of 0.3, 0.5, 0.7, and 0.9. 


x y Ay w&y Sy 
0.1 0.003 

0.064 
0.3 0,067 0.017 

0,081 0,002 
0.5 0.148 0.019 

0,100 0.003 
0.7 0.248 0,022 

0.122 0.004 
0.9 0,370 0,026 

0,148 0.005 
1.1 0.518 0,031 

0.179 
1.3y2* 0.697 


What is the minimum-degree polynomial that will exactly fit all seven data pairs of Exercise 
16? (Answer is nor sixth-degree.) 


Using the x- and y-values given in the table of Exercise 16, construct a divided-difference 
table up to third differences. How do these values compare to those for Ay, Ay, A’y? 


Section 3.5 


19 


Using the data of Exercise 14, write a Newton—Gregory forward polynomial that fits the table 
at x-values of 0.500, 0.625, and 0.750. Then write the Newton—Gregory backward polynomial 
that fits the same three points. Demonstrate that these are two different forms of the same 
polynomial by rewriting both in the form a,x? + ax + a3 

Repeat Exercise 14 using Newton—Gregory backward polynomials, but choose x) = 0.500 
in the first and x, = 0.750 in the second polynomial. Are the same results obtained? 


I 


~ 


258 CHAPTER 3: INTERPOLATING POLYNOMIALS 


»21. Using the data of Exercise 14, estimate f(0.385) by the following: Use quadratic polynomials 
in each case. (a) Newton—Gregory forward, x) = 0.250; (b) Newton—Gregory backward, 
Xo = 0.500; (c) Gauss backward, Xo = 0.375; (d) Stirling, x9 = 0.375. Why are all these 
results identical? 

22. For the data of Exercise 16, write cubic interpolation polynomials that terminate on the third 
difference whose value is 0.004. Use the formulas to write polynomials of the following 
forms: (a) Newton-Gregory forward; (b) Newton—Gregory backward; (c) Gauss forward; (d) 
Gauss backward; (e) Bessel. 

»23. Use each of the polynomials of Exercise 22 to interpolate at x = 0.92. For each polynomial, 
compare in a table the values of yo, s, y(0.92) that are involved, 

Section 3.6 

24, Write the error terms of each polynomial of Exercise 22. Evaluate the coefficients (the 
s-polynomials). 

»25. What is the error term for linear interpolation? Show that the coefficient is of maximum 
magnitude at the midpoint between xp and x,. 

26. A table of tan x with the argument given in radians has Ax = 0.01. Near x = 77/2, the 
tangent function increases very rapidly and hence linear interpolation is not very accurate. 
What degree of interpolating polynomial is needed to interpolate at x = 1.506 to three-decimal 
accuracy? 

27. Estimate the errors for each of the results in Exercise 8. Does this explain the differences in 
the estimates? The function in Exercise 6 is actually f(x) = 1/sin(1 + x), and f(0.15) = 
1.0956. Do the actual errors agree with the estimates of errors? 

28. In Section 3.6, you were asked to explain why the maximum error between f(x) = 
1/(1 + 25x?) and a polynomial that matches f(x) at equispaced points increases when the 
degree of the polynomial is increased. Why does this occur? 

Section 3.7 

29. An operator a is called a linear operator if a(f + g) = af + ag and a(cf) = caf, where 
¢ = constant. Show that E, A, and V are linear operators based on their definitions by 
Eq. (3.11). 

30. Two operators, a and B, are said to commute if the result of operating on a function is not 
changed by changing the order of the operations, that is, a(Bf) = B(af). Show that E, A, 
and V commute. 

31. Define D as the differentiation operator. Show that D commutes with E, A, and V. 

»32. Show that 
a) ALf(x)g(x)] = f()Ag(x) + g(x + AYAS(x). 
b) Ax" = Wx" = WE™x" = n! when h = 1. 
33. Express A'E~?V*y, in terms of y, entries of the table 
>34. If A"y, = V"y,, express r in terms of s. 
35. 8 = E"? — E~'? defines the central-difference operator. Show that 
Sy, = ME-M2y, = VrE"2y,. 
Section 3.8 
36. In Section 3.8, y = P,(x) is given in terms of divided differences. Using this, find the x-value 


37, 


»38. 


39, 
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that corresponds to P,(x) = 3.0 by some method from Chapter |, Then determine the x-value 
by the method of successive approximations described in Section 3.8. 


If the x-values are equispaced, inverse interpolation is easier to do using a Gauss forward 
polynomial rather than with a divided-difference polynomial because the differences remain 
centered on point of inverse interpolation without having to adjust the range of the polynomial. 
Use successive approximations based on a Gauss forward polynomial that fits the data of 
Exercise 11 to find the x-value that corresponds to f(x) = 0.2852. (Actually, the data are for 
f(x) = In x. Compare your answer to e752.) 

For the function y = x?, obviously the points (1, 1), (2, 4), (3, 9) are on its graph, Considering 
x as a function of y, we have x = vi and for y = 25, x = 5. But one may also compute 
x corresponding to y = 25 by inverse interpolation from the first three points. Do this by 
Lagrangian interpolation (extrapolating, of course), What error estimates are available? 


Repeat Exercise 38, but determine x at y = 0 by inverse interpolation. Are you surprised 
that the interpolated x-value is again 0,6? Sketch the curves for x = Vy and for the interpolating 
parabola. 


Section 3.9 


40. 


41. 


45, 


46. 


47. 


Consider the function f(x) = 20/(1 + 5x?) over the interval [—2, 2]. Compute and graph 
polynomials of degree 1, 2, 3, 4, and 5 that agree with f(x) at uniformly spaced points on 
the interval. 

Confirm the statement that, for a set of data exactly fitted by a cubic, the values of § at the 
two ends will be linearly related to the adjacent S-values if the spline curve and the cubic 
polynomial are the same function. If one changes one end value for S, say Sp, does this 
change the portions of the spline curve in intervals other than the first? 

Find the coefficient matrix and the right-hand-side vector for fitting a cubic spline to the 
following data. Use linearity condition on the terminal S-values: 


0.15 0.3945 1.07 0.2251 


0.76 0.2989 173 00893 
0,89. . 0.2685 | 2.11 0.0431 


——— 


Solve the set of equations of Exercise 42 (you may wish to utilize a computer program for 
this), and then determine the constants of the various cubics, The data are the ordinates of 
the normal probability function, Compare a few interpolated values with tabulated values of 
the function, say at x = 0.30, 0.80, 1.50, 2.00. 

Fit a natural cubic spline to f(x) = 20/(1 + 5x?) on the interval [—2, 2]. Use five equispaced 
points on the function [x = —2(1)2]. Graph the spline and compare to the graphs of Exercise 
40. 

Repeat Exercise 44, but use end conditions 2 and 3. Compare results to Exercise 44. Repeat 
again with f'(xp) = 0.9 and f"(x,) = —0.9. 


A cubic spline fitted to a full period of periodic data will have the first and second derivatives 
identical at the two endpoints. Develop the equations that give the S-values for this case. Is 
the matrix tridiagonal? 


The data for Example 2 in Section 3.9 are actually for a periodic phenomenon, but the 
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solutions given ignore this fact. Use end conditions appropriate for a periodic function and 
compare the spline curve so obtained with the results given in Section 3.9. 


Section 3.10 


48. 
49. 
50. 


51. 


53, 


54. 


SS. 


Show that the matrix forms of the equations for Bezier and B-spline curves are equivalent to 
the algebraic equations given in Section 3.10. 
Write the matrix forms of equations for Bezier and B-spline curves of order 4. 


Prove that the convex hull does enclose all the points for both the Bezier and B-spline curves. 
Hint: use the fact that any point p in the convex hull formed by the points {pp, p;, . . . , Pp} 


can be writen as p = 3 a @,p, where a, 2 0 and Sa, =1. 


The slope at the ends of the cubic B-spline seems to be the same as that between the adjacent 
points. Is this true? Does this also hold for B-splines of higher order? 

A succession of points has been used to construct a set of connected B-spline curves. One 
of the points then is changed and the curves are recomputed. What part of the combination 
curve is affected? Is the same kind of effect noticed if the connected curves are Bezier curves? 
What if the set of points were fitted with a cubic spline? Do the terms local control and 
global control apply to the phenomenon you observe? 

Going to higher degrees of B-splines is a natural extension. What about reducing the degree 
to give a quadratic B-spline? What assumptions are reasonable in reducing the degree? 


Compute and then graph the cubic Bezier curves that connect this set of points: 


Point x y 


1 100 100 
2 50 150 
3 200 150 
4 50 50 
5 100 200 
6 50 100 
Hi 200 100 
8 50 50 
2 100 200 
10 100 100 


Repeat Exercise 54 for cubic B-splines. 


Section 3.11 


56. 


»S7. 


In Section 3.11, the assertion is made that the order of interpolation makes no difference. 
Demonstrate that this is true by interpolating within the data of Table 3.10 to find the values 
at y = 0.33 (within rows with x constant at 1.0, 1.5, and 2.0), using cubic interpolation 
formulas that fit the table at y = 0.2, 0.3, 0.4, and 0.5. Then interpolate within these three 
values to determine f(1.6, 0.33), and compare to the value 1.841 obtained in the text. 

After the example of Section 3.11 was finished, it was observed that a cubic in x and a 
quadratic in y would have been preferable. Do this to obtain f(1.6, 0.33) and compare to the 
true value, 1.8350. Use the best “region of fit.” 


>58. 


60. 
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The example of Section 3.11 used a rectangular region of fit when a more nearly circular 
region would appear to have advantages, Interpolate from the data in Table 3.10 to evaluate 
(1.62, 0.31) by a set of polynomials that fits the table at x = 1,5 and 2.0 when y = 0.2 
and when y = 0.4, and at x = 0.5(0.5)2.5 when y = 0.3. Do this by forming a series of 
difference tables. In this instance it is very awkward to interpolate first with x held constant, 
but there is no problem if we begin with y constant. 


Interpolate for (3.55, 0.53) from the following data, using cubics in both directions. 


1.100 0,756 


8.182 6.429 5.625 4.091 
Ke 12.445 9.779 8.556 6.223 
5.2 24.582 19.314 16.900 291 
38.409 26.406 22.237 19,205 


Find the value at x = 3.7, y = 0.6 on the B-spline surface patch that is generated from the 
16 points in the upper left corner of the data in Exercise 59, 


APPLIED PROBLEMS 


61. S.H. P. Chen and S, C. Saxena report experimental data for the emittance of tungsten as a 
function of temperature (Ind. Eng. Chem. Fund. 12, 220 (1973)). Their data are given below. 
They found that the equation 

1.27591 
et = 242. 
eT) 0.02424(557¢) 
correlated the data for all temperatures accurately to three digits. What degree of interpolating 
polynomial is required to match to their correlation at points midway between the tabulated 
temperatures? Discuss the pros and cons of polynomial interpolation in comparison to using 
their correlation. 

Tok 300 400 500 600 700 800 900 1000 1100 

e 0,024 0.035 0.046 0,058 0.067 0,083 0,097 O.111 0.125 

TSE 1200 1300 1400 1500 1600 1700 1800 1900 2000 

e 0.140 0.155 0.170 0.186 0.202 0.219 0.235 0.252 0.269 

62. In studies of radiation-induced polymerization, a source of gamma rays was employed to 


give measured doses of radiation. However, the dosage varied with position in the apparatus, 
with these figures being recorded: 


Position, in. from 
base point 0 0.5 1.0 155 2.0 3.0 3.5 4.0 


Dosage, 10° rads/hr | 1,90 2.39 8 =2.71 2.98 3.20 3.20 2.98 2.74 
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63. 


65 


For some reason, the reading at 2.5 in. was not reported, but the value of radiation there is 
needed. Fit interpolating polynomials of various degrees to the data to supply the missing 
information. Whatsdo you think is the best estimate for the dosage level at 2.5 in.? 


M. S. Selim and R. C. Seagraves studied the kinetics of elution of copper compounds from 
ion-exchange resins. The normality of the leaching liquid was the most important factor in 
determining the diffusivity. Their data were obtained at convenient values of normality; we 
desire a table of D for integer values of normality (V = 0.0, 1.0, 2.0, 3.0, 4.0, 5.0). Use 
their data to construct such a table. 


N D * 10°, cm?/sec N D x 10°, cm?/sec 
i a a 
0.0521 1.65 0.9863 3.12 
0.1028 2.10 1.9739 3:06 
0.2036 2.27 2.443 2.92 
0.4946 2.76 5.06 2.07 


When the steady-state heat-flow equation is solved numerically, temperatures u(x, y) are 
calculated at the nodes of a gridwork constructed in the domain of imterest. (This is the 
content of Chapter 7.) When a certain problem was solved, the values given in the following 
table were obtained. This procedure does not give the temperatures at points other than the 
nodes of the grid; if they are desired, one can interpolate to find them. Use the data to estimate 
the values of the temperature at the points (0.7, 1.2), (1.6, 2.4), and (0.65, 0.82). 


0 0.0 5.00 10.00 15.00 20.00 25.00 
5 5.00 7.51 10.05 12.70 15.67 20.00 
0 10.00 10.00 10.00 = 10.00 10.00 10.00 
> 15.00 12.51 9.95 7:32 4.33 0.0 

0 20.00 15.00 10.00 5.00 0.00 —5.00 


Star S in the Big Dipper (Ursa Major) has a regular variation in its apparent magnitude. Leon 
Campbell and Laizi Jacchia give data for the mean light curve of this star in their book The 
Story of Variable Stars (Blakeston, 1941). A portion of these data is given here. 


Phase | -110 -80  -40 1030 80 110 


Magnitude 7.98 8.95 10.71 11.70 10.01 8.23 7.86 


The data are periodic in that the magnitude for phase = —120 is the same as for phase = 
+120. The spline functions discussed in Section 3.09 do not allow for periodic behavior. 
For a periodic function, the slope and second derivatives are the same at the two endpoints. 
Taking this into account, develop a spline that interpolates the above data. 

Other data given by Campbell and Jacchia for the same star are 


Phase | -100 -60 -20 20 60 100 


Magnitude 8.37 9.40 11.39 10.84 8.53 7.89 


Figure 3.16 


66. 


67. 
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How well do interpolants based on your spline function agree with this second set of 
observations? 


Develop the matrices to make Eq. (3.29) generate points on a Bezier surface. Show that this 
passes through the 12 points on the borders of the group of 16 in Figure 3.10, How could a 
Bezier surface be created that passes through the innermost set of four points? 
A fictitious chemical experiment produces seven data points: 
t | =I —0.96 —0.86 —0.79 0.22 0.5 0.930 
y =f, =0:151 0.894 0.986 0.895 0.5 —0,306 
a) Plot the points and interpolate a smooth curve by intuition. 
b) Plot the unique sixth-degree polynomial that interpolates these points 
c) Use a spline program to evaluate enough points to plot this curve 
d) Compare your results with the graph in Fig. 3.16. 
Heavy line is a cubic spline interpolation 
F Light line is the Lagrangian interpolation 
+ 
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Numerical Differentiation and 
Numerical Integration 


4.0 CONTENTS OF THIS CHAPTER 
(SES aie AS Ee 


Chapter 4 tells you how to estimate the derivative or get the integral of a function 
by numerical methods. Numerical procedures differ from the analytical methods 
of the calculus in that they can be applied to functions known only as a table 
of values as well as to functions stated explicitly. Since computers cannot readily 
be programmed to do the symbolic manipulations of formal integration tech- 
niques, nearly all computer programs for integration use these procedures. The 
material in this chapter has obvious applications in solving differential equations, 
which are the subject of the next chapter. 


4.1 DERIVATIVES, INTEGRALS FROM TABULAR VALUES 
Continues the fictional conversation between Rita Laski and Ed Baker to illustrate 
how numerical integration and differentiation might be required 

4.2 FIRST DERIVATIVES FROM INTERPOLATING POLYNOMIALS 
Derives several formulas by differentiating an interpolating polynomial; expres- 
sions for the errors are also developed 

4.3 FORMULAS FOR HIGHER DERIVATIVES 
Are developed similarly; in addition, the use of symbolic methods is explained 

4.4 LOZENGE DIAGRAMS FOR DERIVATIVES 
Are interesting and useful devices for recording the coefficients of formulas for 
differentiation; they also facilitate the development of alternative forms of formulas 
for differentiation 


45 


4.6 


47 


48 


49 


4.10 


4.11 


4.12 


4.13 


4.14 


4.15 


4.16 


4.17 


4.18 


4.0; CONTENTS OF THIS CHAPTER 


EXTRAPOLATION TECHNIQUES 

Allow you to get an improved estimate of the value of a derivative; they are 
important in computer programs 

ROUND-OFF AND ACCURACY OF DERIVATIVES 

Discusses how this type of error interacts with the truncation error of the formulas; 
the various formulas for finding derivatives are collected here and still another 
method for developing formulas is presented 

NEWTON-COTES INTEGRATION FORMULAS 

Are derived by evaluating the definite integral of an interpolating polynomial, and 
their error terms are similarly derived 

THE TRAPEZOIDAL RULE 

Is another name for the simplest of the Newton—Cotes formulas and is widely 
used in computer programs for integration 

ROMBERG INTEGRATION 

Improves the accuracy of the value for the integral by the neat trick of comparing 
two trapezoidal rule estimates with different Ax values 

SIMPSON’S RULES 

Are two other widely used Newton—Cotes integration formulas of greater accuracy 
than the trapezoidal rule, They too can be extrapolated to get estimates with reduced 
errors 

OTHER WAYS TO DERIVE INTEGRATION FORMULAS 

Include symbolic methods and the method of undetermined coefficients 
GAUSSIAN QUADRATURE 

Is a procedure for getting the integral of a known function that, by using particular 
values for the independent variable, requires fewer function evaluations for a given 
accuracy and hence is more efficient. You are introduced to the orthogonal Legendre 
polynomials in this section 

IMPROPER INTEGRALS AND INDEFINITE INTEGRALS 

Can also be handled by an extension of the methods 

ADAPTIVE INTEGRATION 

Can reduce the number of function evaluations by a proper selection of the value 
for Ax within subintervals of the interval of integration 

APPLICATIONS OF CUBIC SPLINE FUNCTIONS 

Can often give more accurate values for derivatives and integrals than the use of 
ordinary interpolating polynomials 

MULTIPLE INTEGRALS 

Can also be evaluated numerically by extending the methods for single integrals 
ERRORS IN MULTIPLE INTEGRATION AND EXTRAPOLATIONS 
Shows that the error terms for multiple integration are of the same form as for 
single integrals, and hence a Romberg procedure can be applied 

MULTIPLE INTEGRATION WITH VARIABLE LIMITS 

Shows how the methods can be adapted to this type of problem 
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4.1 


4.2 


419 CHAPTER SUMMARY 
Lets you evaluate your understanding of the material 

4.20 COMPUTER PROGRAMS 
Provides several examples of FORTRAN programs for differentiation and 
integration 


DERIVATIVES, INTEGRALS FROM TABULAR VALUES 


Rita and Ed were again at lunch in the cafeteria of Ruscon Engineering. 

“How did you make out on the interpolation problem?” Ed asked. 

“It went all right,” Rita replied. “In fact, we have computer programs now to inter- 
polate either with polynomials or with cubic splines. My boss now wants me to work on 
some extensions.” 

“What do you mean, extensions?” Ed asked. 

“Well, we need the time derivative of the position. That's the velocity, of course. 
And some calculations that lead to the fuel consumption require us to integrate a function 
when we know the values only at discrete times.” 

“I should think that would follow directly from what you did before,” Ed said. “If 
you have a polynomial that gives you the position as a function of time, can’t you just 
differentiate or integrate that polynomial?” 

“Very true,” Rita replied. “The catch is that we never really develop the polynomial; 
we just work with the position values and their differences. My boss warns me also that 
estimating the errors of derivatives and integrals determined from function values that 
are known only at discrete points in time is pretty tricky. I suspect that there are ways 
to tackle these problems that are similar to the methods for interpolation.” 

Rita is right; there are methods for getting derivatives and integrals from a table of 
function values, and these resemble methods for interpolation. In this chapter we will 
explore these methods. 


FIRST DERIVATIVES FROM INTERPOLATING POLYNOMIALS 


If a function is reasonably well approximated by an interpolating polynomial, we should 
expect that the slope of the function should also be approximated by the slope of the 
polynomial, although we will discover that the error of estimating the slope is greater 
than the error of estimating the function. 

We begin this section with the divided-difference polynomial for approximating a 
polynomial. 
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f(x) = P(x) + error 
Flxol + flxo. x1) — x0) 
+ f[xp, X1, X2](x — xg)(x — xy) 


ap ate tei Mine tor GMA = AVON a's wa Ve) 
+ error. (4.1) 
As we saw from Section 3.2, the error term of Eq. (4.1) is 
, forme) 
Error of P,(x) = (x — x9)... (x — Ge DE Xp < E< x, (4.2) 


Differentiating Eq. (4.1), we get 
Pix) = flxo, X41) + flxos *1, X2)(2x — % = x,) 


P 
(x = Xo) = X=) 
Scere Filey dtp mcenengit ial > mae ree (4.3) 


Note that - 

Lap sve = ro S Grr)... HD 
[ee ig). oil = 25) re os %) 
These equations are simplified if we evaluate them at x = x9. Eq. (4.3) becomes 


f'(%q) = flXo. 41) + flX0, 41, X2]%q — 2%) 
+ fl¥o. Xs + Ando — 41) — 2) BoA Ana). (44D 


Differentiating Eq. (4.2) gives us the error of Eq. (4.4): 


LOOMED Se HH) as GO HA) 


Error of P',(x) = 


(n+ 1D)! Eo (x = x4) 
(= '9) Ges aH) Spine (4.5) 
- (n + 1)! axl (6) < 


where & depends on x, However, if we set x = xp in Eq. (4.5), then we get 


pore) 


Error of P’,(%o) = (% — X1) - - » Xo - “G@ + Dt FH! 


Since most of the development of the algorithms presented in Chapters 4 and 5 
assume equispaced points, we will make use of the Newton—Gregory forward polynomials 
and leave the remaining derivations from divided-difference polynomials as exercises. 

We begin with a Newton—Gregory forward polynomial: 


f(x) = P(x.) + error = fo + sA fy + (5)a% treet (S)ars + error. (4.6) 
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The error term of Eq. (4.6) is, by our rule for changing the next term, 


“. 


Error of Bix,) = (, 5 JAMIN), x9< E< ay 4.7) 


Differentiating Eq. (4.6), remembering that f and all the A-terms are constants (after 
all, they are just the numbers from the difference table), we have 


; : d d ds_d 1 
$C) = Pats) = GPC] = IRON G = GIG; 
= 7 (4% +3 766-1 + sdf 
+ AG — IMs — 2) + s(s — 2) + (5s — IY +-- ‘). (4.8) 


The derivatives of the factorials of s rapidly become algebraically complicated.* We 
get considerable simplification if we let s = 0, giving us the derivative corresponding to 
Xo (where s = 0), however. If we let s = 0, Eq. (4.8) becomes 


. 1/ 1 1 1 1 
fo) = (Alo — 5Afo + 50% - gay +--- eZary). 4.9) 
In Eq. (4.9) the derivative value, approximating to the derivative of the function, is 
the derivative of an nth-degree polynomial passing through the point (xo, fy) and n 


additional points to its right, evaluated at x = x9.The error in Eq. (4.9) is obtained by 
differentiating the error term given in Eq. (4.7): 


Error of P}(x,) = nmerpiornye] 2 (i: i + he ‘ a atl Sipe). (4.10) 


The second term here cannot be evaluated, for the way that € varies with x is unknown 
but, when we let s = 0, this second term vanishes because 


ee : a8 
(21) =@rions 1l)...(s—n) =0 


at s = 0. Then we need only evaluate the first term of Eq. (4.5). Since 


Mes )-2=ne=) (an) s(n — 2 =) 2m) es + tots = 0) = 7 FT) 
mice - (n+ D! . 


at s = 0, only the first term of the numerator remains, and Eq. (4.10) becomes 


*In computing the derivatives, we have used the method: 


EXAMPLE 


Table 4.1 
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D = prtlpinthy pe — I 
Error of P}(xo) = A"*'f ef ly at wil) 


ey 
tek 


(4.11) 


php). 


Note that even though the interpolating polynomial gives the function exactly at 
s = 0, our derivative formula is subject to an error of O(h") at that point, unless 
fe) = 0. In comparing Eqs. (4.11) and (4.9) we see that the error term of 
the derivative also can be obtained by changing the A”) in the next term beyond the last 
one included, into h"f‘""(é) exactly as with our interpolating formulas. 


Use the data in Table 4.1 to estimate the derivative of y atx = 1.7. Use h = 0.2, and 
compute using one, two, three, or four terms of the formula, 


W 


6,060, 


310.268 = 5,390. 


NIK Nig 


(0.268) + (0.060) = 5.490. 


| | 1 IX 
3 (0.268) + 3 (0.060) — 410.012) i 


Ay Ay 
/ 
1 
0.041 f 
0,007 f 
0.048 
0.012 
0,060 
0,012 
0.072 


With one term y'(1.7) = 0.31212) 
I 
With two terms y'(1.7) = ag (1-212 
With three terms —-y’(1.7) = (1.212 
0.2 
1 
With four terms y'(1.7) = 55 (1202 
= 5.475. 

be er ern 

x “y Ay ay 

1.3 3,669 
0.813 

1.5 4,482 0.179 
0.992 

1.7 5.474 0.220 
1.212 

1,9 6,686 0,268 
1.480 

2.1 8.166 0.328 
1.808 

2.3 9.974 0.400 
2.208 
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The data tabulated in Table 4.1 are for y = e*, rounded to three decimals.* Since 
the derivative of e* is also e*, we see that the error in the derivative is quite small with 
four terms, comparing’5.475 to 5.474. We can anticipate this; since the fourth differences 
in Table 4.1 do not vary greatly, the function is represented reasonably well by a fourth- 
degree polynomial. 

Since we know that f(x) = e*, the estimated errors for the preceding computations 
can be calculated by Eq. (4.11): 


= JP 
With one term, — error = CV ose, 17<é=1.9, 


2 
_ —0.2fe!? (min)| _ {—0.547 (min) 
~ 2 |e? (max) | —-0.669 (max) * 
(Actual error is —0.586.) 
With two terms, error = CV 0.2276), Lises 2.4, 
_ 0.04fe!7? (min) | _ [0.073 (min) 
3 [e?! (max) 0.109 (max) J” 
(Actual error is 0.084.) 
-13 
With three terms, error = <P 0.2978), 17's. 65/233, 
_ 0.008 fe!” Som _ f{-0.011 eat) | 
4 |e?? (max) —0.020 (max)J” 
(Actual error is —0.016.) 
—1)4 
With four terms, error = PY orre. 17 <€<2.5, 
_ 0.0016 fe!” all _ f0.002 Sao 
~ 5 le®5 (max)J ~~ (0.004 (max) J 


(Actual error is —0.001.) = 


In the case of four terms, the calculated and actual errors do not agree because of 
round-off errors; Eq.(4.11) accounts only for the truncation error. In the other cases, the 
round-off is not large enough relative to the truncation error to invalidate the estimates 
from Eq. (4.11). 

The formulas for derivatives given by Eq. (4.9) are called forward-difference approx- 
imations because they involve only differences of function values forward from f(x). 
For a forward-difference approximation, the interpolating polynomial that is used does 
not fit points that are symmetrical about x». We have seen in the previous chapter that 
interpolation is more accurate near the center of the range of fit. A similar effect is 
observed with derivatives. Consider the first two terms of Eq. (4.8): 


f= i Ato + ue -1+ 8%) + error. (4,12) 


*The table incidentally illustrates the effect of rounding errors in the fourth differences. A more accurate table 
would show fourth differences of 0.0088, 0.0108, 0.0132. | 
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The second-degree interpolating polynomial to which Eq. (4.12) corresponds fits at 
Xo, *,, and x2. Let s = 1 in Eq. (4.12) so that we can estimate f'(x,): 


i ! er 
Pre) = (4% - ho) + error. 


Write out the differences in terms of f's: 


a i U4. . k 
Ly) = ili =go 7h - 2h +10) 


Equation (4.13) can be called a central-difference approximation. The value of x 
where it is applied is centered in the range of fit of the polynomial—differences of 
function values on either side of f(x,) are used. 

The error term for Eq. (4.13) can be derived similarly to Eq, (4.11). It is 


Error in P5(x,) = -inep"(e), Xo ES, 


Note particularly that the power of h in Eq. (4.14) is two, so the error is O(h?). Contrast 
this to the first power of h in the error term when only one term of Eq. (4.8) is used, 
although both computations involve only two functional values, The coefficient in the 
error term is also significantly smaller. Central-difference formulas are decidedly superior 
in calculating values for derivatives. 

If we apply Eq. (4.13) to the data of Table 4.1 to estimate f’(1.7), we get 


, . 6,686 — 4.482 _ 
£7 = JE = 5-510. 
~0.030 cao 


Estimate of error is ae (max) 


(Actual error is —0.036.) 


Central-difference formulas similar to Eq. (4.13) can be derived using higher-degree 
polynomials of even order. (The odd-degree polynomials do not have a range of fit that 
is symmetrical about any of the x-values.) For example, the formula corresponding to a 
fourth-degree polynomial, expressed in terms of function values rather than differences, 
and related to the point xp, is 


, — 1fo- 8f-1+ 8i -—h _1 apy, sates, SIS 
f'%0) = | 1D > Eror = 35h PE), x2 S ES Xp. 
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4.3 


In many applications it is preferable to use the simpler formula of Eq. (4.13) and 
control the error by making A small. Observe, however, that the error of the derivative 
from an nth-degree pSlynomial is O(h"), while the error of interpolation is O(h"*"). 


FORMULAS FOR HIGHER DERIVATIVES 


We can get formulas for higher derivatives by further differentiating Eq. (4.8), and 
differentiating Eq. (4.10) will give the error terms. For example, 


a 1 da? 
FC) = Fo.) = Basins) 


1/1 2 1 
= #(5O2% + rag 2) + =D +e—2 +s (4.15) 


+(s—1) +518 +--+). 
Ats = 0, 
f%( y= te, - 2% + +>) 
XQ jet to to - 
There is no simple pattern for the coefficients, and the error terms have complicated 
coefficients as well. The situation gets more complex with still higher derivatives, so we 


would like a simpler method. We have seen before that symbolic methods have given 
interpolation formulas with relative ease. It is true for derivative formulas also: 


E=1+A, 
ys = E*yo, 
; te ld 1 
Ys = GE V0) = 7G (E*Yo) = 7 (ln ENE*Yo. (4.16) 
Ats = 0, 
emg 1 
yo= zn E)yo = Pug (1 + Adyo 
1/ Let 1 
=F (Ave - 54% + $v — Gn + -*). (4.17) 


In writing Eq. (4.17), we have used the Maclaurin expansion for In (1 + A). Note that 
Eq. (4.17) is identical to Eq. (4.9), which we derived before by nonsymbolic means. 
If we use the symbol D for the derivative operator. 


o 
k 


Dyo = jin (1 + A)yo- (4.18) 
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We abstract the equivalence between operators from Eq. (4.18): 
p=ina + A). (4.19) 


The value of the symbolic method is that algebraic operations on operator relations 
give valid results. Let us raise each side of Eq. (4.19) to powers: 


2 2a 4 

D? = jsln? (1 + A), 

D? = fin (1 + 4), 

B= aie (+A). (4.20) 


Equations (4.20) show that we can get formulas for higher derivatives by multiplying 
the series in Eg. (4.17) by itself. The second derivative is given by: 


1 Dogs. Jaq > 3 
Dryp = 4a (A = 5 +g — git + +) 34 
Vio 3.1] Eve 
= pl’ - #4 Gs - ges \yo 
ra dlig 3 Si * 
=p Sig = Mig + aa Min = GA t=]. (4.21) 


Formulas for derivatives of higher order can be similarly obtained by multiplication 
of the series in Eq. (4.17). 

Equations (4.20) lead to an important generalization for the estimation of derivatives 
from forward-difference formulas. The first term of all the formulas in terms of differences 
has only 4”. Hence. as a first approximation.* 


Dy = nary + error O(h). 


EXAMPLE Use formula (4.21) to estimate 1"(1.7) from Table 4.1 using terms through 4°. Also 
estimate the error. 


1 7) = 1 768 — 
y(1.7) Top 0-268 0.060) 


=5.200 (compare to 5.474, exact answer) 


ediillen 
Enor = 73(75#*s"6)). 17 <€<21, 


*This was used in the previous chapter im developing the “next-term rule” for the error of an interpolation 
formula. 
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4.4 


by covering Ghe moat oom Ge foils Se = = 
Mix valse of exe = He 2Fie*) = 0298, 
Min valse of cxor = U0. 2F¥e*7) = 0.201, 


Gope:e td eeelose af 0275 = 


Ip Se absence of now indee of Ste Face, we cowld eve ased Se Soh Séeremces 
to ese 27" 2s shown by Eq (4.22) B the Gieremces wary greats. his is bazardoes_ 
however. Ip hus mstemce we would este dhe coor as aot 0.23 fs we eed 000 
(the evereze welne of S*y) 2s on exe of PE 

(Cound Giiterences for igeer devacives will ewe noe accuracy hon Ge fore 
Gifiesence formule sequeseated by Bg. (4.21). We pecter w develop these through Ge 
logense ciesrams of the mew serdon 


Samce there ace 2 wamety of ogerwelear fomes of che mmexpoletine potyeommel. we aey 
obtain & wancty of forms of demwanve formals: by Gaiercamumng Ge wanoes polyno. 
They ae mmexreimed, of course, 25 we beve sees for eezpolueoe feces Es com 
wemiean 20 extant dius immemrcion by 2 so-collied Ioeease Gurren. 2s @ Re +1. 
Exemmmamon of the loesese Gace m Fic. 4.1 shows & wp be 2 Giieremce able 
Syanholbe form (staniier to Table 3.3) with aemmericed woboes 3 
These velnss me used gs coeffi of de Géferemoes of f m fom Ge dese 
Soemole and ave selected by woacne 2 puters Geoesh the Gieremce while 
(Comem rales quest be followed: 


1. One bess 2 Se colmme of fs 2 §, The pom of becemme Geoms the 
sebscopmme. and ieace che wales of 5, cme be comesest wat Ge subecmpes 
of the 77s 

Ome procends thom eft oo cigie. eather Gascmally epead oc doersad » the 
neat Gference column, or dhiesmatiwely orinoenaliix. A tom os added for every 
oolmmm crossed. 

The teem tht we add & he coy oe Ge beeen het seers. mip By 
Ge camy ahowe @ Ge les sep wes Ginsomally Gowewead, by Ge eaty below 
@ ae ix sep wes Geconelly epee. oo by Gee evereee of ccs howe ad 
below af @ wes Soro The coeficeet of the Femme &s aboays aero. 2s Ge 
imierspersed Giemres Indice. 


‘These is © specud) property of Joga Gieseees thot ones thee casy > CoESuct 
The Gesom com be comudieed w be of two pes, ee Giiereacss of Ge Gere and 
Ge Gtesspersed cocficacams, ane superposed an Se other. Consider Se acey of coef 
icaats of Fie. 4.1, shown bee as Fee. 22 

Compare dhe cokes @ pees. bese wk cole A and 8. Observe tht he 
valuss i A we the Giierences of Ge welees mn B_ Now compare cofamms 5 and C- coleman 


te 
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Lozense diagram for j' (x). (Note: All formulas must be multiplied by 1/h.) 
i 
2 
3 
=e 
2 
3 
z 
5 
z 
2 
t 
c 


Figure 4.1 
Figure 4.2 
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B is just the differences of the values in column C, with each value subtracted from the 
one above it. The same thing is true for columns C and D, columns D and E, and so 
on. In other words; the array of coefficients is a difference table, but written from right 
to left and upside down. This means that one line of coefficients, which we get from Eq. 
(4.9), suffices to create the entire array of coefficients! In Fig. (4.2), this set of coefficients 
is shaded. 

Equation (4.21) serves to start a lozenge diagram for the second derivative at x = 
Xo in a similar fashion. The result is Fig. 4.3. 

We use the lozenge diagrams to write formulas for differentiation of a tabulated 
function by following a path through the difference table. Any path through the table 
may be chosen; the coefficeints are then given by the rules discussed above. Suppose we 
use a horizontal path beginning at fo: 


# = 7{[224 94+ ow, + (-Z#o54....] 
fe [844 cht aE, ow] = £541 + on), 14.23) 
85 0 an 3 Ms 4 
0 ince Af 4 | ie vee a nies 
‘ = 7 a ee ; ing Laem a 
ta ee ; os ped ‘ —. 
0 Aha ' van 4 ~ ah, 
f a 0 se ae \ = “a 
0 eae. ar ee f i i —— ities: Bf, 
*2 0 Sh, 0 Mf 0 
0 . ee Mt ae \ = as eee 
- . = aaa a i ae 
f 0 cra =I va 4 
eee ' ie ae 
ee ’ ee ee ee = eee 
f, 0) Mf 2 Mf =e 
0) Af I Pf, 2 fe 
ee 0 Sree ee ee 
ii Ae ga oe 8 on ed 
0 Af 1 Pf x sf, 
ee er es 
tae ° cia aaci -4 af, -3 


Figure 4.3 Lozenge diagram for f"(x). (Note: All formulas must be multiplied by 1/h?.) 
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Using Fig. 4.3 to find the second derivative, again following a horizontal path, we obtain 


af 
dx? 


=lle etd ee 
gm Be (glia | 
1 
= pall ~ 2fo + F2) + OCH*)] 


= ea + On). (4.24) 


Observe that following a horizontal path leads to a central-difference approximation. 
Following a diagonally downward path gives a formula based on a Newton-Gregory 
forward polynomial. 

Equations (4.23) and (4.24) are formulas of special importance, Eq. (4.23) being 
the same as Eq. (4.13). Note that even though they are only one-term formulas, they 
have errors of O(h?), in contrast to the O(h) error for Eqs. (4.8) and (4.21), when only 
one term is used. As previously stated, these O(h?) formulas are known as central- 
difference approximations. Their good error terms occur because the point at x = x9 is 
at the center of the range of x-values at which the interpolating polynomial matches the 
table. Central-difference formulas of higher order can be developed by continuing further 
on our horizontal path. 

The efficiency of the central-difference approximations for second derivatives, as 
compared to the forward-difference approximation, is illustrated by the following example. 

Using the data of Table 4.1, we compute f"(1.7). 


Using forward differences, Eq. (4.21), 


f"L.7) = 022 ~ 6.700, 
Error = —1.226. 
Using central differences, Eq. (4.24), 
sf _ 0.220 _ 
f".7) = 22> 5.500, 
Error = —0.026 


The central-difference expression is much more accurate. The estimated errors for 
the above computations are — 1.095 (min) to — 1.633 (max) for forward differences, and 
—0.015 (min) to —0.022 (max) for central differences. (Note that, in the second case, 
round-off causes the actual error to fall outside these limits.) 

Figure 4.3 gives a formula of O(h*) error for the second derivative if we extend the 
horizontal path until we reach the column of fourth differences: 


aay 2 ms 3 4 
= ica Ay] + O(h*). 
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45 


Rewriting this in terms of the Fvalues. we get an eguiwdlent expression: 


fr = aM 8 oor 


Applying these formulas to the date of Table 4.1. we compute 7"(17) 2s 


FADH= plo - Zum) = SARS 


_ —B.166 + 1616686) — 305 474) + M6(S AB?) ~ 3509 _ «ops 
” cp ne: ie. ae ;em ” é 
(Compare the tesuht 10 the exact walue: 5.474. The exor, —0.011. as much larger than 
the truncation exror (—0.0049 19 —0.00208) due 19 tound-off_) 

The central-difference approximations extend directly 10 nuher deswatives 2s welll. 
Lozenge diagrams can be constructed for these. but f we need only the first nonzero 
term, 2 simple formule will suffice. Por the even-order demwatwes i as 


i ae + OUP), never 


The difference required 5 always found an the table on the horizontal line throng x, and 
Jy. If the difference table thas not been constructed. we would express © am temms of the 
values. 

For odd-order derivatives, the expression involves the average of difierences found 
just above and just below the horizontal lime through 2,: 


Fz). = +007), scott (4.25) 


EXTRAPOLATION TECHNIQUES 


There is an altemative way of successively improving the accuracy of our estimates of 
the derivative that is equivalent 10 afiding terms 2s given by the lozenge diagram. This 
is the technique of successtve extrapolations. it has special utility a» computer programs 
for differentiation of arbitrary functions. 

‘We illustrate the method first for 2 function known only from 2 table, such as Table 
4.2. We desire the derivative f(x) at x = 2.5. The central-difierence formula, Bg. (4.13) 
or Bg. (4.23), gives 


; = 0.25337 — 0.31729 
re..-4§ 3 -§ Se 


This estimate ‘has an error O(4"). 


Table 4.2 
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x F(x) x Fix) 
2.0 0.42298 2.6 0.25337 
2.1 0.40051 2.7 0.22008 
2.2 0.37507 2.8 0.18649 
2.3 0.34718 2.9 0.15290 
24 0.31729 3.0 0.11963 
2.5 0.28587 


We can also compute the derivative by using values at x = 2.7 and x = 2.3, for 
which the spacing is 0.2, and get 


a = 0.22008 — 0.34718 _ 


f'@Mbhz2s5 702) —0.3178. 


The error here is also O(h7), but now A is twice as large as in the previous calculation. 

In order to combine these two estimates and extrapolate to a more accurate estimate, 
we examine the nature of the error term of Eq. (4.23) in greater detail. Using the lozenge 
diagram, Fig. 4.1, we have 


, A /Afy + Af 1N62+ 86, 1S + SE, 
i ane a Fa — ++). a6) 


If we truncate after the first term, we get an error of O(#7), but Eq. (4.25) shows that 


3 
p= Fete apa + O(n). 
Equivalently. we may write 
1Mf_, + Lf ae . 
85a = —G fo + Ol) = Ch? + O(K). 


In this expression C is a constant: we merely recognize that the third derivative of f 
evaluated at xy has a fixed value. Equation (4.26) can hence be written as 


Shit Sos c+ ont) + Lupe 


fo=q 5 


ail 
h 
ae Ch? + O(h*). 
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We see that we make only an order #* error in assuming that the error of Eq. (4.23) is 
Proportional to #7. We then combine the two estimates of the derivative, —0_3196 with 
h = 0.1 and —0.3178 with & = 0.2. as follows: 
Fdexact) = —0.3178 + Cl + Of"), nased on less accurate vane, 
= —O0.3178 + C(0.2P, fee meget te OF) em 
Fefexact) = —0.3196 + Cir + OF"), based on more accurate wale. 
= 0.3196 + COP, we negint the OF) tem 


We solve for the unknown value of fj (exact) by eliminating the terms in C between 
the two expressions, 


Fijexact) = 0.3196 + 110.3196 — (-0.3178)) 


—0.3203. 


The method can be applied to any two estimates, each with Ofh") errors and with values 
of A varying by 2 to 1. The rule is 


Exact value = more accurate 


)(more accurate — less accurate). 


Gee, 


We show that the relationship is not perfect by using the = sign: this lack of exactness 
is the result of our neglect of the O(i:"">) terms im the two expressions. 

This extrapolation method can be extended since the first-order extrapolations with 
error O(h*) can be shown to have exrors of the form Ch* + O(h*). (Observe that this 
follows from the alternately zero coefficients on the horizontal path through Fig. 4.1, so 
this principle extends to all orders of extrapolation. Exrors due to rounding of the original 
data eventually dominate. however, so the extrapolation technique must be abandoned 
at some point.) 

For the data in Table 4.2 we can extend to one more order of extrapolation. We 
arange the data in tabular form. 


Inmal First-order Second-order 
a estimate extrapolation extrapolation 
Oi —0.31980 
—0.32022 
02 0.31775 —0.32020 
—0.32050 


o- —0.30951 


4.6 
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The last entry was calculated as 


—0.32022 + 


a j (0.32022 + 0.32050) = —0.32020. 
An extra guard figure was carried to minimize round-off in the computations. 

Since the lozenge for second derivatives, Fig. 4.3, also shows alternate coefficients 
to be zero, approximations for the second derivative can be extrapolated in exactly the 
same way, the order of the error increasing to h*, h®, h’, and so on, at each step. 

It is left as an exercise for the student to show that this extrapolation method is the 
equivalent to passing a higher-degree interpolating polynomial through the data (with the 
point x = x at the center of the region of fit for each of these), and then finding the 
derivative by central-difference formulas. 

The application of this extrapolation method in a computer program is important 
because successive estimates, with h decreasing, have smaller and smaller truncation 
errors, and hence provide a built-in measure of accuracy. The method is generally 
employed to differentiate a function of known form, rather than a tabulated function. The 
basic scheme is to compute the derivative at two values of h, one being one-half the 
value of the other, and to extrapolate as in the above example. If a significant change is 
made by the extrapolation, compared to the more accurate of the original estimates. / is 
halved again and a second-order extrapolation is made. Again if a significant improvement 
results, another computation is made with / halved once more and a third-order extrap- 
olation is made. The process is continued until the improvement is less than some criterion. 

Since round-off affects calculation by computers just as it does hand computations, 
eventually these errors will control the accuracy of successive values, so the student is 
cautioned against indiscriminate use of extrapolation. It is generally good practice to put 
a limit to the number of repetitions of the process. This is discussed in the next section, 


ROUND-OFF AND ACCURACY OF DERIVATIVES 


It is important to realize the effect of round-off error on the accuracy of derivatives 
computed by finite-difference formulas. As we have seen, decreasing the value of / or 
increasing the degree of the interpolating polynomial increases the accuracy of our deriv- 
ative formulas. (These procedures reduce the truncation error.) 

As h is reduced, however, we are required to subtract function values that are more 
nearly the same value, and this, we have seen, incurs a large error due to round-off. The 
increase in round-off error as h gets smaller causes the best accuracy to occur at some 
intermediate point. The data in Fig. 4.4(a) and (b) illustrate this phenomenon. The test 
was made with both single-precision arithmetic (giving 6-7 significant figure accuracy) 
and double-precision (14-15 figures). The function was f(x) = e*, at x = 0. The true 
value of f’(0) is 1.0, of course. Central-difference approximations were employed. 

In the single-precision test, the optimum value of / was only 0.01 for the first 
derivative, and the best accuracy of the second derivative was at h = 0.1. a surprisingly 
large value. When the round-off errors were reduced by using double-precision, essentially 
exact results were obtained for the first derivative throughout the range of f values. For 
the second derivative. h = 0.1 x 10-7 or hh = 0.1 * 1073 gives the best accuracy. Note 
how large the error grows; it is especially marked for second derivatives when h becomes 
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Figure 4.4 


H DX DDX 
O.1E 00 0.1001663E 01 0.1000761E 01 
0.1E-01 0.1000001E 01 0.9959934E 00 
0.1E-02 0.9999569E 00 0.8940693E 00 
0.1E—03 0.9959930E 00 -0.8344640E 02 
0.1E-04 0.9775158E 00 —0.4768363E 04 
0.1E-05 0.9834763E 00 —0.5960462E 0S 
0.1E—06 0.5960463E 00 —0.1192093E 08 
0.1E-07 0.2980230E 01 —0.5960458E 09 
0.1E—-08 0.2980229E 02 —0.5960459E 11 
0.1E—09 0.2980229E 03 —0.5960456E 13 
a) Results with single precision 
H Dx DDX 

0.1E 00 0.1001667E 01 0.1000834E 01 
0.1E-01 0.1000016E 01 0.1000008E 01 
0.1E-02 0.1000000E 01 0.1000000E 01 
0.1E-03 0.1000000E 01 0.9999999E 00 
0.1E-04 0.1000000E 01 0.9999989E 00 
0.1E—05 0.1000000E 01 0.1000031E 01 
0.1E-06 0.9999999E 01 0.9894834E 00 
0.1E-07 0.9999999E 00 ~0.4163322E 00 
0.1E-08 0.9999999E 00 -0.4163319E 02 
0.1E-09 0.1000000E 01 0.2775S45E 04 


b) Results with double precision 


Derivatives computed by central-difference formulas f(x) = e*, 
atx = 0. 


much smaller than the optimum. Higher derivatives than the second will be expected to 
show an even more exaggerated influence of round-off. 

Different models of computers vary widely in the number of digits of accuracy they 
provide, due to the different number of bits provided for the storage of floating-point 
values and also, to a lesser degree, due to the format of their floating-point representations. 
For example, a CDC Cyber provides 60 bits for floating-point values, and gives the 
equivalent of about 14 decimal digits of accuracy. Another factor that affects the accuracy 
of functional values is the algorithm used in their evaluation. Chapter 10 gives some 
insight into this. Different software packages that are used with computer operating 
systems employ different algorithms, which are of varying accuracy. Further, as Chapter 
10 will show, the error of approximating the function differs with the value of the argument. 

No analytical way of predicting the effects of round-off error seems to exist. The 
best we can do is to warm budding numerical analysts so they will be properly wary when 
computing derivatives numerically. In the words of many authorities, the process of 
differentiating is basically an unstable process—meaning that small errors made during 
the process cause greatly magnified errors in the final result. 

Differentiation of “noisy” data encounters a similar problem. If the data being dif- 
ferentiated are from experimental tests, or are observations subject to errors of measure- 
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ment, the errors so influence the derivative values calculated by numerical procedures 
that they may be meaningless. The usual recommendation is to smooth the data first, 
using methods that are discussed in Chapters 3 and 10. Passing a cubic spline through 
the points and then getting the derivative of this approximation to the data has become 
quite popular. A least-squares curve may also be used. The strategy involved is straight- 
forward—we don’t try to represent the function by one that fits exactly to the data points, 
since this fits to the errors as well as to the trend of the information. Rather, we approximate 
with a smoother curve that we hope is closer to the truth than the data themselves. The 
problem, of course, is how much smoothing should be done. One can go too far and 
“smooth” beyond the point where only errors are eliminated. 

A final situation should be mentioned. Some functions, or data from a series of tests, 
are inherently “rough.” By this we mean that the function values change rapidly; a graph 
would show sharp local variations. When the derivative values of the function incur rapid 
changes, a sampling of the information may not reflect them. In this instance, the data 
indicate a smoother function than actually exists. Unless enough data are at hand to show 
the local variations, valid values of the derivatives just cannot be obtained. The only 
solution is more data, especially near the “rough” spots. And then we are beset by problems 
of accuracy of the data! 

For convenience, we collect formulas for computing derivatives here. 


Formulas for Computing Derivatives 
Formulas for the first derivative: 


Fo) = A= + om, 


fo) = 25 + ove). (Central diference 


f' Go) = Thain th + Oh). 


¥ 
Foo) = BE a 0g, ent dtc 


Formulas for the second derivative: 


fq) = 28h * bo + om, 
f'%) = re + O(h?). (Central difference) 


Pl) = BAB Ph * 2h + ou), 


of Ta = Yb 
ej fz + 16f; et 1664 Su 2 + OW), 
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Perhaps an easier and more intuitive approach to deriving the above formulas is 
through a Taylor series expansion about the point x9. Suppose we have the points 


Bagi ase shay (x, 2), where f, = f(x;) and the points are evenly spaced, that is, 
x; =X +i*h,i = —2,..., 2. If we consider the series expansion about xo for h, 
we get 
he ns ht 
Fl) = fi = fot Wot Zh + G0 + aafhl to (4.28) 
ae Lap et Tee en 
f@—1) = fr = fa — hfo + fo — GLO + agft — °° (4.29) 


Subtracting these we find that 
faz fi 3 + Ol); 
adding the same we get 
f= hh £1 + 00), 
If we take a Taylor series expansion for x = +2h, we obtain 
4 rey 
fa = fo + 2Wfo + UefS + yELE + FHPG + oo (4.30) 
, 2¢" 4 3 2 Aeiv 2 
f-2 = fo — 2hfy + 2h*fo — zh’fo + zhS0 a (4.31) 
If we combine these new ones with the previous, it is not hard to see that 
fp= ft Bia Bast £2 4 omy, 


12h 


poe otis aa IGFs oy, 
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Using this same approach, we can generate our own formulas for special needs. For 
instance, suppose our data points are unevenly spaced as in the following table: 


x S (x) 
0.90 2.4596 
1.00 2.7183 
111 3.0344 


The expansion about x) would be 


ie rig 

f™+0=fo + tfot+ af + sfo teees (4.32) 
s s 

S% — 9) = fo — Sfo + Zfo- Gfo +++ '- (4.33) 


From these we can derive the equation 
L Ls 
(afb ean Se feo — 3) 
fo= er 
Loe 


Using the values in the table with ¢ = 0.11 and s = 0.10, we get from Eq. (4.34) 


- (2 - ‘Vf. (4.34) 


1 1 
= 844) ~ ——(2 
; (sai (3.0344) 001 (2.4596)) 1 1 


fo= 1 “lon 0.10 


)(2.7183) 


0.11 0.10 
#2237239; 


Since these values are based on f(x) = e*, the exact answer is 2.7183 at x = 1.0. 


NEWTON-COTES INTEGRATION FORMULAS 


The usual strategy in developing formulas for numerical integration is similar to that for 
numerical differentiation. We pass a polynomial through points of the function, and then 
integrate this polynomial approximation to the function. This permits us to integrate a 
function known only as a table of values. When the values are equispaced, our familiar 
Newton-Gregory forward polynomial is a convenient starting point, so 


b 
| f(x) dx = [ P,(x,) dx. 


286 CHAPTER 4 NUMERICAL DIFFERENTIATION AND NUMERICAL INTEGRATION 


The formula we get from Eq. (4.35) will not be exact because the polynomial is not 
identical with f(x). We get an expression for the error by integrating the error term of 
Px) 


Exor = ft. 2 ene dx. 


There are various ways that we can employ Eq. (4.35). The interval of tegration 
(a, 5) can match the range of fit of the polynomial. (x5. x,). In this case, we get the 
Newton—Cotes formulas; these are a set of integration rules corresponding to the varying 
degrees of the interpolating polynomial. The first three, with the degree of the polynomial 
1, 2, or 3, are particularly important. and we discuss them at length in the following 
sections. 

If the degree of the polynomial is too high, errors due to round-off and local irreg- 
ularities can cause a problem. This explains why it is only the lower-degree Newton— 
Cotes formulas that are often used_ 

The range of the polynomial and the interval of integration do not have to be the 
same. If the interval of integration extends outside the range of fit, however, we are 
extrapolating. and this incurs larger errors. If we only desire to get the integral of a 
function whose values are known, we will normally avoid extrapolation. As we will see 
in the next chapter, however, iftegrating the polynomial outside its range of fit leads to 
some important methods for solving differential equations. Using an interval of integration 
that is a subset of the points at which the polynomial agrees with the function also has 
special application in solving differential equations numerically. 

The utility of numerical integration extends beyond the need to integrate 2 function 
known only as a table of values. Most computer programs for integration of functions 
whose form is known use these numerical techniques rather than the analytical methods 
of the calculus. While it is possible to write a program to do symbol manipulation and 
hence to perform analytical integration for simple forms of elementary functions, these 
routines lack generality. Furthermore, in a large number of cases no closed form for the 
integral exists. Numerical integration applies regardless of the complexity of the integrand 
or the existence of a closed form for the integral- 

Let us now develop our three important Newton—Cotes formulas. During the inte- 
gration, we will need to change the variable of integration from x to s since our polynomials 
are expressed in terms of s. Observe that dx = A ds. 


Forn = 1, 


1 2 =I 
[reas Pes amacma[", e+ aaras 
; eae 
= ites] + nah 5] = al fo + 34h) 
0 “i6 os 
= Heh + Ui - Wl = He + fi- (4.36) 
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Error = ii oo SS— Dire) dx = wre) [| 4 
xo 


2 
s2 


1 
= vir - alk = her", Xo < Ei SH. (4.37) 


In getting the error term of Eq. (4.37), we have used the mean-value theorem for integrals, 
which is that 


b b 
[ FOxdg(x) dx = f(E) [ a(x)dx, asé=b, 


provided that g(x) does not change sign in the interval (a, b). This applies to Eq. (4.37) 
because s(s — 1) is always less than or equal to zero on (0, 1), The value of & is not 
necessarily the same as , though both fall in the interval (xo, x)). 


For n = 2, 


[so ac= |” (+ sah + se Dp) dx 
=nfe eas pis Pay) as 
his]. + m5], val? -5)] 


= (2h + 28%) + 38%) =F +4 th. (4.38) 


0 


When we integrate the next term to get the error, we find 


Beers =" 
[ s(s its Dap, ao 0 
0 


sO, to get the error, we must integrate the succeeding term: 


x2 — -7 =- 
Biue I ss = Dis — WS ~ Dy sping) de 
xo, 24 


i] 


— Herne), Xq < & <x. (4.39) 


We omit the details of the integration in Eq. (4.39). 
Similarly, for n = 3, we find 


) 3 3h 
f f(x) dx I Py(x,) dt = FU + 3f + 3 + fi). (4.40) 
Oo x9 


3 
Error ~ go FMA), Xg < & < x3. (4.41) 


\ 
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4.8 


In summary, the basic Newton—Cotes formulas are 


mh h i ss 
I fo) de = Fy +f) — GhP"O 


xo 


1 


[, foo ae = Fs + 4h +) - GohPO 


ze 3h 
[fen ar = Fun + 3h + 38+) — EPO 


An important item to observe is that the error terms for both n = 2 and n = 3 are 
O(h>). This means that the error of integration using a quadratic is similar to the integral 
using a cubic; it is a consequence of finding the integral of the next term equal to zero 
in deri the error term of Eq. (4.39). Note also that the coefficient in Eq. (4.39)— 
(-4)—is smaller than that in Eq. (4.41)—(-3). The formula based on a quadratic 
is unexpectedly accurate. 

This phenomenon is true of all the even-order Newton—Cotes formulas; each has an 
order of / in its error term the same as for the formula of next higher order. This suggests 
that the even-order rules are especially useful. 


THE TRAPEZOIDAL RULE 


The first of the Newton—Cotes formulas, based on approximating f(x) on (xo, x;) by a 
straight line, is also called the trapezoidal rule. We have derived it by integrating P\(x,), 
but the familiar and simple trapezoidal rule can also be considered to be an adaptation 
of the definition of the definite integral as a sum. To evaluate f° f(x) dx, we subdivide 
the interval from a to b into n subintervals, as in Fig. 4.5. The area under the curve in 
each subinterval is approximated by the trapezoid formed by replacing the curve by its 
secant line drawn between the endpoints of the curve. The integral is then approximated 
by the sum of all the trapezoidal areas. There is no necessity to make the subintervals 
equal in width, but our formula is simpler if this is done. Let A be the constant Ax. Since 
the area of a trapezoid is its average height times the base, for each subinterval, 


a pro se it aT h 
| fa) dx = FW ALD ay =e + 7,,,), (4.42) 
and for [a, b] subdivided into subintervals of size h, 
e Sh h 
[ f(x) dx = > si thad=shAthththt- +h thavi 
a izle ye 


b 
h 
i fla) dx = Af, + fy + Ye t+ + Ui + Sav (4.43) 


Figure 4.5 


EXAMPLE 


Table 4.3 
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f(x) 


X =a & x x x = b 


Equation (4.42) is identical to Eq. (4.36). Equation (4.43) is called the extended 
trapezoidal rule: it lets us apply the formula over an extended region where f(x) is far 
from linear by applying the procedure to subintervals in which it can be approximated 
by linear segments. The formula is beautifully simple, and its applicability to unequally 
spaced values is useful in finding the integral of an experimentally determined function. 
It is obvious from Fig. 4.5 that the method is subject to large errors unless the subintervals 
are small, for replacing a curve by a straight line is hardly accurate. 


Suppose we wished to integrate the function tabulated in Table 4.3 over the interval from 
x = 1.8 to x = 3.4. The extended trapezoidal rule gives 


3.4 5 
I ; f(x) dx = 9.2 6.050 + 2(7.389) + 2(9.025) + 2(11.023) + 2(13.464) 
+ 2(16.445) + 2(20.086) + 2(24.533) + 29.964] = 23.9944. 


The data in Table 4.3 are for f(x) = e*, so the true value of the integral is e3 4 — 
e! 8 = 23.9144. We are off in the second decimal. 

We have previously derived the error of the trapezoidal rule as Eq. (4.37), We repeat 
it here: 


] 1 “ 
Local error of _ —phr (&). X <= Ey < Hj. 


trapezoidal rule — 
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‘This cxrur, & shoold be emphasized. is the coror of only = Sins stcp. and is heave 
called he local error. We sommally apply the trapezoidal fornia 2 serics of sobumtervals 
> eet he meee ot 2 bee ered foe x = 6 x = 6. We we tested he 
otal coror, wich is called the piohal error_ 

To develop he formuis for slobel exror of the mepezoadal mele, we mote Get @ is the 
seme of the Jocel erurs- 


Gobel exe = — L1G) = PIG) + --- = FE 449 


Is Ey (4.4) cach of Ge voles of E's fond & Ge = smoceseve sobaervals. Hf we 
asseme that 7"(x) is comtmmmoes on (¢, 5), hheve is some walne of x mm (2. 5), sxy x = & 
22 winch the valor of he sem mm Bq. (4.44) is cgeel to  - 716). Since mk = 3 — the 
Shobel exror becomes 


mie 


Peri =O). 445) 


2 


Tie fact hae che global exror as OUR?) wile the Local emer as Of") 3s reasonable since. 
foe cummpie. ¢ & is belwed She meer of sabemmervels is Goobied_ so we add toeether 
Tee 2S SREY ScrOEs. 

When Se fencsoe fix) is beown. Bq. (4.45) pememts os w estimate he coor of 
mame sesrmoe by the sapezondel mele. ip applvine Gas equation we teacket hr 
emer by calcein: with Ge aatoem and he cee vellecs of 7x) on the mecrval 
fe, Bh 

For the example howe. cer enor expecssion gives these estima 


Eaor = -see. 18=£=34, 


__ 1. fe? cm] _ [-00233 om 
7 OPO) 2 poe ~ |-a.1538 pon fl 
— lex, fe* (em))] _ [0.0323] 
Enor = pers ed Fe: at = [-01588}- 


The ace coor ws -0.0 ig 


E we 2 oot bow Ge form of Se coe Sor Wc: ee eve oe wales, wee 
would steam 27° £) fom ae second Serene. 
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ROMBERG INTEGRATION 


There is a way to improve on the accuracy of the simple trapezoidal rule when the function 
is known at equispaced intervals, or when it is explicitly known so that it can be computed 
as desired. This method is known as Romberg integration, or as extrapolation to the 
limit. It is based on the same principle as that used in Section 4.5 for the extrapolation 
of derivative values. We illustrate the method by a reconsideration of the data in Table 
4.3. We saw that the trapezoidal rule gave a value of 23.9944 for [4 f(x) dx, which 
differs from the true value of 23.9144. Purely for illustrative purposes, we do not desire 
to take advantage of our knowledge of the function, so we do not make h smaller to 
improve our estimate of the integral. 

Even if we should not make the interval smaller, we can certainly make it larger, 
by using only every other value, taking h = 0.4. If we do this, we get: 


I(h = 0.4) = 946.050 + 2(9.025) + 2(13.464) + 2(20,086) + 29.964] 


= 24,2328. 


This value is even more in error than before, which, of course, we expect with h 
larger; but we can take advantage of this to nearly eliminate the error of our former 
computation if we assume that errors of O(h*) can be interpreted as proportional to h?, 
say error = Ch’. Taking C to be a constant applies strictly only in the limit, of course, 
and at h = 0.2 and h = 0.4 we are not very near zero. It can be shown,* however, that 
we make an error of only O(h*) by assuming that the errors are proportional to h?. We 
then have, for the two computations above, 


True value = computed value + error, 


True value = 23.9944 + C(0.2)°, 
True value = 24.2328 + C(0.4)?. (4.46) 
In Eq. (4.46) there are two unknowns, the “true value” and the proportionality constant 


C, but we can certainly solve for these with two equations. We get, subtracting the second 
equation from four times the first, 


True value = 23.9944 + (23.9944 — 24.2328) 
= 23.9149 (versus 23.9144, exact value) 
We have extrapolated from two inexact values to an improved one. (What we have 


called “true value” is not exact because it used the erroneous assumption that O(1?) was 
the same as Ch?.) 


*See, for example, Davis and Rabinowitz (1967), who show that the error can be expressed as 
Ch? + Cah* + Ghe + - 
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We can make a further improvement by extrapolation again, if we have two such 
estimates of O(h*) accuracy, by assuming that their errors can be stated as Ch*. Such a 
second estimate can bé obtained from the data in Table 4.3 by backing off once more, 
and computing the area with h = 0.8. If we do so we get J(h = 0.8) = 25.1768. We 
use this with /(h = 0.4) = 24.2328, as shown below. 


Trapezoidal rule estimate Extrapolated value 
h O(h?) errors O(h*) errors 
0.2 23.9944 
23.9149 
0.4 24.2328 
23.9181 
0.8 25.1768 


As before, we set up relationships: 


True value = 23.9149 + C(0.2)*, 
True value = 23.9181 + C(0.4)*. 


Solving, we get 


True value = 23.9149 + Ke3.9149 — 23.9181) 


= 23.9147. 


This is now as close as can be determined to the exact value, 23.9144. One further 
order of extrapolation could be made, for we could start with h = 1.6 and, by combining 
with h = 0.8 and Ah = 0.4, get another second-order extrapolation. These second-order 
extrapolations can be shown to be of O(h°) accuracy and can be combined to an estimate 
of O(h’). No further improvement results from this, however, due to the influence of 
round-off errors in the original data. With only three-decimal accuracy there, it is not 
surprising to find errors in the fourth decimal of the integrals. 

The number of successive improvements that can be made, and hence the order of 
the error that can be attained, obviously depends on the number of functional values that 
are available, or that are computed when f(x) is a known function. The number of 
successive improvements that should be made may be less than the number that can, in 
principle, be made, because of the effects of round-off error. 

We summarize our extrapolation rule, to be used for improving two values for which 
h varies 2:1 (it is identical to Eq. (4.27), which was derived for derivatives): 


) (more accurate — less accurate). (4.47) 


Improved value = more accurate + ( 


v1 
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The exponent n in the coefficient is the exponent on / in the error term O(h") that 
applies to the two values used for extrapolation. One always knows that the more accurate 
value is the one computed with smaller h. 

Actually, one hardly ever improves the estimate of an integral by comparing results 
from a small h to those with a larger h; instead the reverse is done. The previous discussion 
is primarily motivational and to give a basis for the following. 

When this technique of extrapolation is applied to integrals of known functions so 
that values can be computed for smaller values of h, it is known as Romberg integration. 
It is quite popular in computer programs,* often being available as a stock subroutine in 
the computer library. The technique is to compute the integral for an arbitrarily chosen 
A, and then to compute it again with h halved. If the two values for the integral differ 
by more than some tolerance value, the value is improved by extrapolation, and a second 
integration is computed with / halved again. This is combined with the previous value 
of the integral to give a second extrapolated value that is compared to the first. If necessary, 
these two are combined to give a higher-order extrapolation. The method is continued 
until a pair of extrapolated values of the integral agree satisfactorily. 

The Romberg method is applicable to a wide class of functions—to any Reimann- 
integrable function, in fact. Smoothness and continuity are not required. 


SIMPSON’S RULES 


The Newton—Cotes formulas based on quadratic and cubic polynomials are also widely 
used. In their composite forms they are known as Simpson's rules. There are two of 
these. 

The second-degree Newton—Cotes formula integrates a quadratic over 2A.x intervals 
that are of uniform width. (Such intervals are often called panels.) We repeat the following 
equation from Section 4.7: 


s 


Ma h 5 
I f(x) dx = 5(fo + 4fi + A) - pe), Xy = ES x. (4.48) 
xo 3 90 


This is the very popular Simpson’s 4 rule, which has a local error of O(h>). If we 
apply this to a succession of pairs of panels to evaluate [2 f(x) dx, we get 


‘b 
h 
[fe de = FU tA + 2 EA HD EO thy + rod 


_ (ba) 


180 HPO}: 4 FSx3. (4.49) 


*A subroutine that performs Romberg integration is given at the end of this chapter. 
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EXAMPLE 


In Eq. (4.49) we have subscripted the x-values so that x, = a and x,., = b. In 
getting the global error we have used the fact that = f*(&) = nf(€), which imposes a 
continuity condition on the fourth derivative. Here n, the number of times that the local 
tule was used, is only one-half the number of Ax steps, because we integrate over two 
panels each time; this causes the denominator to change to 180 from 90. Note that 
Simpson's 3 rule has a global error of O(h*). Its use requires that we subdivide into an 
even number of panels. 


Apply Simpson's 3 rule to the data of Table 4.3 to evaluate J}-4 f(x) dr. 
Using h = 0.2, we get 


3.4 
am f(x) dx = 926.050 + 4(7.389) + 2(9.025) + 4(11.023) 
+ 2(13.464) + 4(16.445) + 2(20.086) 
+ 4(24.533) + 29.964] 
= 23.9149. 
It is no coincidence that this value is the same as the extrapolated value from the 


trapezoidal-rule estimates at h = 0.2 and h = 0.4. The student should demonstrate that 
these two procedures always give identical results. 


G4 = 1.8)69 9s, e'® (min) | _ [-8.6 x 10-* 
180 “ (ee4 (max) —4.3 x 10-4)" 


The actual error is —5 x 10+; round-off has magnified it. = 


Error = 


An extrapolation can also be made for this example. Using h = 0.4, we get 


3.4 > 
Sia f(x) dx = °2 16.050 + 4(9.025) + 2(13.464) + 4(20.086) + 29.964] 


= 23.9181. 
Similarly, with h = 0.8, we get 23.9653. Our results can be summarized in a table as 


we did in the previous section: 


Simpson’s rule estimate Extrapolated value 
h O(h*) errors O(h®) errors 
0.2 23,9149 
23.9147 
04 23.9181 
23.9166 


0.8 23.9653 
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The values in the last column are obtained by the extrapolations: 


True value = 23.9149 + R23.9149 — 23.9181) = 23.9147, 


i 


True value = 23.9181 + (23.9181 — 23.9653) = 23.9166. 


A further extrapolation of these last two values gives 


True value = 23.9147 + ay(23.9147 — 23.9166) = 23.9147. 


No improvement is obtained due to round-off errors and the short precision used in Table 
4.3. Had we used 10 digits of precision throughout, this last result would have been 


I 


True value = 23.914456 + ay (23.914456 — 23.914644) 


23.914453. 
e4 — el 8 = 23.914453. 


il 


Exact value 


Since the error term for Simpson’s rule can be written as Ch* + Ch° + Ch +--+ >, 
we can do a Romberg-like set of successive extrapolations, as we have shown in this 
example. 
The third Newton—Cotes formula that finds frequent use is obtained by integrating 
a cubic interpolating polynomial over its range of fit. We have already derived it as Eq. { 
(4.40) and the error term as Eq. (4.41). The local rule is } 


*3 i 3 
I f(x) dx = La + 3f,+ 3ht+f) -SAP(E), x S ES 43. 
x0 8 80 


The value of the coefficient 34/8 gives it its common name, Simpson's 2 rule. When 
applied over the interval from x, = a to x,,, = b (the interval must be divided into a 
multiple of three panels), we get 


b 
[ fo ae = FU + 3h + ht et e+ et + ert Sr 
(b — a) 


+ 3h + int) — gg PME), 2 2 ae 


(4.50) 


The global rule has a coefficient in the error term of > reduced from a in the local error, 
because we apply the local rule n/3 times when there are n panels. 
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EXAMPLE 


As we have previously seen, the order of the global exror. O(k*). 3s no higher than 
for Simpson's 3 rule and the coefficient is larger. One might wonder whether the ; rule 
should ever be used. 

As we have seen, Simpson's 3 rule requires that the interval be subdivided into an 
een eh eae When applied to tabulated data, this is often impossible. In soch 
s. one can combine the two rules, applying the ee 
te isiegciten by picking up uae pee wei Wed tule. An alternative method for an 
ep Sis ee a ee This involves 
a penalty of an O(h") error over that subinterval. however, and is somewhat Jess accurate. 


Use the data of Table 4.3 to find the integral from x = 1.6 tox = 3.4. There are nine 
panels, so we cannot apply the 3 Tule directly. We Gem seco clues © beni Os 
qwobican agiily che kcagicuasdel cleyareis Simpson's 3 rule: or combine Simpson's + 
tule with either one of the others. We investigate cach choice. 
With the trapeziodal rule: 

34 


1) dx = 25.0947, error = —0.0836. 
With Simpson's ; rule: 
ie fix) dx = 25.0119, error = —0.0008. 
With Simpson's 3 rule from 1.8 to 3.4, and the trapezoidal mule from 1.6 10 1.8: 


34 
[ Z f(x) dx = 25.0152, exror = —0.0041. 


With Simpson's i Tule from 2.2 to 3.4, and the : rule from 1.6 to 2.2: 


34 
[ P f(x) dx = 25.0115, emor = —0.00084. « 
jt 


The results are as we would anticipate: The best accuracy is obtained when the i and 
2 rules are combined. We still have a choice to make when we combine the rules: at 
which three panels should we use the 2 rule? A consideration of the error term of the 7 
mule ives the answer: eppil.de where ies comicis Need. This anceos thet we showld choner 
the three panels where the fourth derivative of the function is smallest. For the data of 
this example, we know the function [/(x) = ¢] so the choice we made is best. The 
choice will not always be obvious when we know f(x) only as the table of values. The 
differences will be minor, however, and round-off errors distort the picture. so this point 
is of more theoretical than practical importance. 

For convenience. we collect here the Newton—Cotes integration formulas. In a later 
section we will consider adaptive integration methods using these formulas. In addition. 
we will consider ways to find the step size. h, so that accuracy is obtained without unduly 
increasing the number of function evaluations. 
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Formulas for integration (Uniform spacing, Ax = h) 
Trapezoidal Rule: 
b 
h 
[seo ar= San + 2fp + 2fg +--+ + 2h, + fata) 


= C—O np), as<é<b. 


Simpson’s } Rule: 


‘b 
[fen de = FU + A +O + At et FA + Saad 


(b — a) 


= eo Ae), asx&=b. 


(Requires an even number of panels.) 


Simpson’s j Rule: 


b 
3 
[fon de = FU + fe + fs + fe + e+ Mot + hat Sood 


_ (b= a) 
80 


Ape), a sé=b. 


(Requires a number of panels divisible by 3.) 


OTHER WAYS TO DERIVE INTEGRATION FORMULAS 


It is interesting to examine other ways of deriving integration formulas than by integrating 
the interpolating polynomial. One way is to use symbolic methods, similar to those of 
Section 4.3 and Section 3.7. 

In terms of the stepping operator E, 


fxs) = fs = Efo- 


Multiplying by dx = h ds and integrating from xp to x, (s = 0 to s = 1), we obtain 


x ! Al ‘i 
‘ade =n py as = [tty]! — MED 
I. fix)dx =h I, E*fy ds = E =fo| = ThE “foe 


0 


Let E = | + A and expand In (1 + A) as a power series: 


n+ ay=a-4es iw ta gees, 
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On dividing A by this series, we get 


xy Ss 
f(x) dx = 1 nS 
A 


7A 


2 lo las fo 
Posey eer tas ia 


= Ly Ae ar ame ago eae 
Wfo + x\fo — 754? So + 3q4%fo ): (4.51) 


The coefficients are considerably easier to get by this technique than by the term- 
by-term integration of Section 4.7. 

Equation (4.51) is not a Newton—Cotes formula unless we use only the first two 
terms, making the interval of integration and the range of fit of the polynomial agree. 
When n terms are used, it represents a polynomial of degree n, fitting from xp to x, but 
integrated only from xp to x,. This is especially useful in connection with differential- 
equation methods. 

We can develop the formula for an nth-degree interpolating polynomial integrated 
over two panels, from xg to x2, in a similar fashion: 


* _ [aE* ]? _ WE? -1), _ ME + IE - 1) 
[i700 a - [Fen] - InE o> aL 


Again letting E = 1 + AsoE — 1 = Aand E + 1 = 2 + A, and dividing A by 
the series for In (1 + A), we get 


” l eS 1 
[400 a = h2 + MNifo + 54fo — 7a fo + ag 4%fo - °°”) 
1 l 
= RQ2fy + Afo — G8 fo + 754 fo— °° 


1 ee | 
+ Afo + 54°fo — 754°fo + Ran) 


= WQfy + 2Bfy + 5AM + 0-- +). on 


The method can obviously be extended. One reason why we might desire formulas 
such as Eqs. (4.51) and (4.52) is to use them to construct lozenge diagrams similar to 
Figs. 4.1 and 4.3. These would permit integration over m panels based on polynomials 
of degree n. The diagrams are easy to construct if we remember that the coefficients 
themselves form a difference table that is interlaced with the function differences. Since 
they are of rather special application, we do not exhibit them here. 

We now present still another interesting method of deriving formulas that can be 
applied to a variety of situations including the development of integration formulas. It 
may be called the method of undetermined coefficients.* We express the formula as a 
sum of n + 1 terms with unknown coefficients, and then evaluate the coefficients by 
requiring that the formula be exact for all polynomials of degree n or less. We illustrate 


*This method is described in more detail in the Appendix, and several examples are presented 
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it here by finding Simpson’s ; tule by this other technique. Express the integral as a 
weighted sum of three equispaced function values: 


1 
[fo dx = af(—1) + bf(O) + cf(+1). (4.53) 


The symmetrical interval of integration simplifies the arithmetic. We stipulate that 
the function is to be evaluated at three equally spaced intervals, the two end values and 
the midpoint. Since the formula contains three terms, we can require it to be correct for 
all polynomials of degree 2 or less. If that is true, it certainly must be true for the three 
special cases of f(x) = (x) = x, and f(x) = 1. We rewrite Eq. (4.53) three times, 
applying each definition of f(x) in turn: 

rl 
f(x) = 1: | , a= 2= all) + BI) + cI) = a +b + 6; 


rl 
f(x) = x: | 3 dx = 0 = a(—1) + (0) + e(l) = —a+ ¢; 
rl 


f(x) = x: [ea = >=a(l) + b(0) + c(l)=atec 


win 


Solving the three equations simultaneously gives a = i, b= + = 1 Here the spacing 


between points was unity; obviously the integral is proportional to Ax = h. We then get 
Simpson's } rule: 


” pan de = Wl Apc—my + Sco) + dpe 
_,F% x= afl ) 3/0) afl) |. 


GAUSSIAN QUADRATURE 


Our previous formulas for numerical integration were all predicated on evenly spaced x- 
values; this. means the x-values were predetermined. With a formula of three terms, then, 
there were three parameters, the coefficients (weighting factors) applied to each of the 
functional values. A formula with three parameters corresponds to a polynomial of the 
second degree, one less than the number of parameters. Gauss observed that if we remove 
the requirement that the function be evaluated at predetermined x-values, a three-term 
formula will contain six parameters (the three x-values are now unknowns, plus the three 
weights) and should correspond to an interpolating polynomial of degree 5. Formulas 
based on this principle are called Gaussian quadrature formulas. They can be applied 
only when f(x) is explicitly known, so that it can be evaluated at any desired value 
of x. 

We will determine the parameters in the simple case of a two-term formula containing 
four unknown parameters: 


1 
[fo = af(t,) + bf(t). 
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EXAMPLE 


The method is the same as that illustrated in the previous section, by determining unknown 
Parameters. We use 2 symmetrical interval of integration to simplify the arithmetic. and 
call our variable r. (This agrees with the notation of most authors. Since the variable of 
integration is only 2 dummy variable. its name is unimportant.) Our formula is to be 
valid for any polynomial of degree 3; hence it will hold if f(r) = 3, f(r) = 9°, f(t) = 
1, and f(t) = 1: . 

1 


f@) =P: 


,P a= 0= ati + br}: 


f(t) =1: dt=2=a+b. (4.54) 
Multiplying the third equation by 7, and subtracting from the first, we have 
0=0+ of = 1317] = b(t, — HY + 4). (4.55) 


We can satisfy Eq. (4.55) by either b = 0, = 0, & = tf or & = —1;. Only the 
last of these possibilities is satisfactory, the others being invalid. or else reduce our 
formula to only a single term, so we choose 1; = —f,. We then find that 


a=b=1, 


8=-f = \; = 0.5773, 
1 
[fo dt = f(—0.5773) + (0.5773). 


It is remarkable that adding these two values of the function gives the exact value 
for the integral of any cubic polynomial over the interval from —1 to 1. 

Suppose our limits of integration are from a to b, and not —1 to 1 for which we 
derived this formula. To use the tabulated Gaussian quadrature parameters, we must 
change the interval of integration to (—1. 1) by a change of variable. We replace the 
given variable by another to which it is linearly related according to the following scheme: 

If we let 


(b-—at+bt+a 
iiiaialie ts ao 


& _b-af' (-at+b+a 
[rem ae= 252] (COS a 


Evaluate J = {37 sin x dx. (Obviously. J = 1.0, so we can readily see the error of our 
estimate.) 


ls, Sia ta 


4.12: GAUSSIAN QUADRATURE 301 


To use the two-term Gaussian formula, we must change the variable of integration 
to make the limits of integration from —1 to 1. 
Let 


a (a/2 + 77/2 


5 + so dx = (7/4) dr. 


2.) Then 


I= zf sin( = *) at. 


The Gaussian formula calculates the value of the new integral as a weighted sum of 
two values of the integrand, at r = —0,5773 and at r = 0.5773. Hence, 


Observe that when ¢ = —1, x = 0; whenr = 1, x = 


a 
" 


F[(1.0sin(0.10566z) + (1.0)(sin(0.394347)] 


0.99847. 


The error is 1.53 x 1075. 


The power of the Gaussian method derives from the fact that we need only two 
functional evaluations. If we had used the trapezoidal rule, which also requires only 
two evaluations, our estimate would have been (7/4)(0.0 + 1.0) = 0.7854, an answer 
quite far from the mark. Simpson's ¢ tule requires three functional evaluations and 
gives J = 1.0023, with an error of —2.3 x 1073, somewhat greater than for Gaussian 
quadrature. 

Gaussian quadrature can be extended beyond two terms. The formula is then given 
by 


cl fn 


fo a=> wi f(t). for m points. 


el 


This formula is exact for functions f(r) that are polynomials of degree 2 — 1 or less! 
Moreover, by extending the method we used previously for the 2-point formula, for each 
n we obtain a system of 2 equations: 


0, fork = 1,3, 5.0.00; 7 at 
wette++ tw wtk=4 5 
oi" for =O) 2 Fos. aces 2n — 2. 


This approach is obvious. However, this set of equations we get by writing f(r) as a 
succession of polynomials is not easily solved. We wish to indicate an approach that is 
easier than the methods for a nonlinear system that we used in Chapter. 2. 


(CHAPTER 4: NUMERICAL DIFFERENTIATION AND NUMERICAL INTEGRATION 


It turns out that the 1;’s for a given m are the roots of the mth-degree Legendre 
polynomial. The Legendre polynomials are defined by recursion: 


(n+ DLq+1) — 2n + 1)xL,) + nb, (2) = 0, 


with Lo(x) = 1, Ly(x) = x. 


Then L,(x) is 
Lx) = et (koe) _ 3 ae i 


here roots are Vi = +0.5773, precisely the r-values for the two-term formula. 
By using the recursion relation, we find 


Sx3 — 3x 
L(x) = oy 

i 2 
iy ES tein 


The methods of Chapter | allow us to find the roots of these polynomials. After 
these have been determined, the set of equations analogous to Eqs. (4.54) can easily be 
solved for the weighting factors because the equations are linear with respect to these 
unknowns. 

Table 4.4 lists the zeros of Legendre polynomials up to degree 5, giving values that 
we need for Gaussian quadrature where the basic polynomial is up to degree 9. For 
example, L,(x) has zeros at x = 0, +0.77459667, and —0.77459667. 

Before continuing with an example of the use of Gaussian quadrature. it is of interest 
to summarize the properties of Legendre polynomials. 


1. The Legendre polynomials are orthogonal over the interval [—1. 1]. By this we 
mean that 


=Oifn=m, 


1 
i LQ) L(x) de is Gifu 


This is a property of several other important functions, such as {cos(nx). 
n=0,1,.. .}. Here we have 


=Oifn=m: 


Qe 
I cos(mx) cos(nx) dx {5 Oita= a. 


In this case we say that these functions are orthogonal over the interval (0, 27]. 


4.12 GAUSSIAN QUADRATURE 303 


2. Any polynomial of degree n can be written as a sum of the Legendre polynomials: 
n 


P(x) = > Lx). 
=0 


3. The n roots of L,(x) = 0 lie in the interval [—1, 1]. 
Using these properties, we are able to show that Eq. (4.56) is exact for polynomials of 
degree 2n — | or less. 


The weighting factors and t-values for Gaussian quadrature have been tabulated. 
(Love (1966) gives values for up to 200-term formulas.) We are content to give a few 
of the values in Table 4.4. 

We illustrate the three-term formula with an example. 


EXAMPLE Evaluate / = {}$ e-* dx using the three-term Gaussian formula. 


_ (15 ~ 0.2 + 1.5 + 0.2 
, — U5 0.2 + 15+ 02 


2 = 0.651 + 0.85. 


rs L5 3 0.2 [, e7(0.65:-0.857 ay 


0.65[0.555 . . . e710.65(-0.775...)-O0.85F + O ggg. ¢-10.650.0)+0.85) 
+ 0.555... e710-65(0.776...)-0.857] 


= 0.6586 (compare to correct value 0.65882). = 


Table 4.4 Values for Gaussian quadrature 
a a 


Number of Weighting” Valid up to 
terms Values of 1 factor degree 
24» —0.57735027 1.0 3 

0.57735027 1.0 
3 —0.77459667 0.55555555 5 
0.0 0.88888889 
0.77459667 0.55555555 
4 —0,86113631 0.34785485 7 
—0.33998104 0.65214515 
0.33998104 0.65214515 
0.86113631 0.34785485 
5 —0.90617975 0.23692689 9 
—0.53846931 0.47862867 
0.0 0.56888889 
0.53846931 0.47862867 
0.90617975 0.23692689 
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4.13 


EXAMPLE 


with bounded integrands. Sometimes we want to evaluate integrals such as 


= [ xe dx (4.57) 

lL= ‘3 5 (4.58) 
lo Vx 

i= [x= + 1d. (4.59) 


Equation (4.57) has an infinite upper limit: Eq. (4.58) has an infinite value for the integrand 
at the lower limit. x = 0; Eq. (4.59) is an indefinite integral whose value is a function 
of r. 

Sometimes we can transform such integrals into the simpler forms that we have 
discussed above, by a change of variable. For example. we can break /, into the sum of 
two integrals: “1 = 

i= I, xe* de > |, xe* dx, 


and then change the variable in the second term by substituting y = 1/x, so that 


[i= 


In the transformed integral, the value of the integrand at y = 0 is indeterminate (0/0), 
but this causes us no trouble because the limit of the integrand is zero as y — 0. 

Where the integral exists. as in Eq. (4.57). we can apply our numerical methods 
directly. however. We just evaluate 


4 


| xe * dk 
lo 


for increasing values of A, and use the limiting value of these results as the value of the 
integral with infinite upper limit. Observe that this is just an adaptation of the usual 
calculus definition of the value of an integral with an infinite upper limit. One must be 
on guard in doing this when the convergence of the integral is unknown, for round-off 
error may give an upper limit to the sequence of results even though the integral actually 
diverges. 


Table 4.5 gives results for the integral m Eq. (4.57). They show how the value of the 
integral reaches a limiting value as the value of A increases. (A computer program that 
implements Romberg integration was used.) 

The value of the integral in Eq. (4.58) can be found similarly by finding the value 
of I, with a2 variable lower limit that approaches closer and closer to zero, the point where 
the integrand is undefined. Equation (4.59) is also amenable to this same approach: we 
let 1 take on a series of values and produce a table that relates the value of J, to values 
of f. 


Table 4.5 


4.14 


EXAMPLE 
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Values for | = [4 xe-* dx, A> ~ 
et oe or ee 


A Mi 
it 0.26424 
10 0.99950 
100 1.00001 
1000, 1.00001 
10000, 1.00001 
(~) (1.00000) (analytical result) 


Equation (4.59) is a simple special case of a situation we will treat at length in the 
next chapter, which deals with differential equations. 


ADAPTIVE INTEGRATION 


In Sections 4.8 and 4.10, we presented the trapezoidal rule and Simpson’s rules for 
finding the integral of f(x) over a fixed interval [a, b] using a uniform value for Ax. 
When f(x) is a known function, one can choose the value for Ax = A arbitrarily. The 
problem is that we do not know what value to choose for / to attain a desired accuracy. 

One way to select the value for h is that of Section 4.9. We can start with two panels, 
h = hy = (b — a)/2, and apply one of the formulas. Then we let hy = h,/2 and apply 
the formula again, now with four panels, and compare the two results. If the new value 
is sufficiently close, we terminate and apply a Romberg extrapolation to further reduce 
the error. If the second result is not close enough to the first, we again halve h and repeat 
the procedure. We continue in the same way until the last result is close enough to its 
predecessor. 

We illustrate this obvious procedure with an example: 


Integrate f(x) = 1/x? over the interval (0.2, 1]. Use a tolerance value of 0.02 to terminate 
the halving of Ah = Ax. From calculus, we know that the exact answer is 4.0. 

In order to compare this to an alternative approach, we introduce a notation that will 
be helpful throughout this section. 


b 
Define: [(f) = I f(x) dx, the exact value. 
la 
S,(f) = the computed value using Simpson’s 4 rule with Ax = h,,. 


If we use this notation, the extended Simpson's rule becomes 


(b — a) 


If) = Sf). - Fag 


Whey),  a<&<b. 
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Figure 4.6 


Using this with h, = (1.0 — 0.2)/2 = 0.4, we compute S,(f). We continue halving 


h, h,+, = h,/2, computing its corresponding S,. ;(f) until |S,.,(f) — S,()| < 0.02, 
the tolerance value.“The following table shows the results. 


n h, Sn |S.) — 5.9) 
1 04 4.948148 
0.761111 
2 0.2 4.187037 
0.162819 
3 O14 4.024281 
0.022054 
4 0.05 4.002164 
0.002010 
5 0.025 4.000154 


From the table we see that, at n = 5, we have met the tolerance criterion since |S(f) — 
S(f)| < 0.02. A Romberg extrapolation gives 


RSH) = Saif) + SPS & 4.00002. 


(We use the notation RS(f) to represent the Romberg extrapolation from Simpson’s rule.) 
. 


The disadvantage of this technique is that the value of h is the same over the entire 
interval of integration, while the behavior of f(x) may not require this. Consider Fig. 
4.6. It is obvious that, in the subinterval [c, 6], A can be much larger than in subinterval 
[a, c]. One could subdivide [a, b] by personal intervention after examining the curve of 
f(x). We prefer to avoid such intervention, of course. 


fx) 


{ 
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Adaptive integration automatically allows for different h’s on different subintervals 
of [a, b], choosing values adequate for a specified accuracy. In adaptive integration, we 
do not specify how the total interval [a, b] is to be broken up; the point c could occur 
anywhere within it. Since the place to subdivide [a, b] is not known a priori, we use 
something like a binary search to find it. Actually, the total interval [a, b] may be broken 
into many subintervals. This depends on the tolerance value and the behavior of f(x). 

The strategy is to first divide [a. 5] in half. compare the two results, and if the 
tolerance criterion is exceeded, focus our attention only on the right half of [a, 5], setting 
aside the results for the left half for later use. We then divide the right half of [a, 5] in 
half and again compare results (now using a criterion half as great to compensate for the 
reduced interval). We repeat until the terminating condition is met. Then we focus attention 
on the portions of the interval of integration that were set aside, working from right to 
left. 

We repeat the above example to illustrate the procedure, again using a tolerance 
value of 0.02. 

Because we are working with different subintervals of [a. b] we modify the above 
notation slightly: 


5,(@;. b;) is the computed value from Simpson’s rule for h,, = (b — a)/2", where 


asa,<b, <b. 


The example begins just as before. Take h, = (1.0 — 0.2)/2 = 0.4. Then 
§,(0.2. 1.0) = 4.94814815, 
§(0.2. 1.0) = $,(0.2. 0.6) + 5,(0.6, 1.0) 


= 3.51851852 + 0,.66851852 = 4.18703704. 
hy = 0.2. 


Since |53(0:2. 1.0) — $,(0.2. 1.0)| = 0.7611111 > 0.02. we must continue. We set 
aside all the information we have obtained in evaluating S,(0.2, 0.6) for the moment. 
Focusing on [0.6, 1.0], we compute 


$0.6, 1.0) = $,(0.6, 0.8) + S,(0.8. 1.0) 
= 0.41678477 + 0.25002572 = 0.66681049. 
hy = 0.1. 


Since |§3(0.6, 1.0) — $0.6, 1.0)| = 0.00171803 < 0.02/2 = 0.01, we meet the 
criterion. We use a Romberg extrapolation to get 


0.66681049 — 0.66851852 


15 


RS(0.6. 1.0) = 0.66681049 + 
= 0.66669662. 
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We are finished with the subinterval [0.6, 1.0] and tm oor attention to the interval 
[0.2, 0.6]. 


5(0.2, 0.6) = $(0.2, 0.4) + $0.4. 0.6) 
= 2.52314815 + 0.83425926 = 3.35740741, 


A, = 0.1. 
Now |S,(0.2, 0.6) — S(0.2, 0.6)| = 0.16111 > 0.02/2. We must continue with h, = 
0.05 and restrict ourselves to [0.4, 0.6]. We find that 
5,{0.4, 0.6) = $0.4, 0.5) + $40.5, 0.6) 
= 0.50005144 + 0.33334864 = 0.83340008. 
h, = 0.05. 


Since |S,(0.4, 0.6) — $,(0.4, 0.6)| = 0.00085918 < 0.02/4, we use 2 Romberg extrap- 
olation to get 


RS(0.4. 0.6) = 0.83334280. 
We move on to the last set-aside interval, [0.2. 0.4]. We just summarize the results: 


0.83334924 — 0.83356954 


RS(0.3. 0.4) = 0.83334924 + is 
= 0.83333455, and 


1.66680016 — 1.66851852 


RS(0.2, 0.3) = 1.66680016 + is 


= 1.66668560. 
For the last two operations, we met a local tolerance of 0.02/8. Summing all the RS 
values, we get 
RS(0.2, 1.0) = RS(0.2, 0.3) + RS(0.3, 0.4) + RS(0.4, 0.6) + RS(0.6, 1.0) 
= 4.00005957. 


The result should be compared to 1(0.2, 1.0) = 4.0, the exact value. Using adaptive 
integration instead of the first simplistic procedure, we reduced the number of function 
evaluations from 33 to 17. 

This diagram shows how the total interval was subdivided. with varying step sizes. 


os + a _ + 
02 03 04 os os o7 os os 19 
= — — = 

h#= 0025 = 005 h=0.10 
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efficient in reducing the total number of function evaluations. However, one must store 
the set-aside information for later use. Before we explain how this can be done in a 
computer, let us review the information we had to set aside in this example. There were 
three times when we had to set aside several values—S,(0.2, 0.6), 53(0.2, 0.4), and 
53(0.2, 0.3). We surely don’t want to recompute function values, so there are seven 
quantities to remember for each; we use a seven-column array for these: 


Left f(x) at S(x) at f(x) at Step Local 

point left middle right size tolerance S(a, c) 
0.2 25.0 6.25 2.7778 0.2 0.01 3.51851852 
0.2 25.0 HW. 6.25 0.1 0.005 2.52314815 
0.2 25.0 16.0 VW. 0.05 0.0025 66851852 


The first row was stored from $,(0.2, 0.6); after computing RS(0.6, 1.0) it was 
retrieved. The second row is from $3(0.2, 0.4); it was retrieved after computing RS(0.4, 
0.6). The third row is from $,(0.2, 0.3) and was retrieved after computing RS(0.3, 0.4) 

For a more complicated function we would have more rows but still seven columns. 
Observe that we always retrieve the last row stored and store the next row at the bottom 
of the array. 

The following algorithm implements the adaptive integration presented. 


b 
Algorithm for Computing /(f) = [ f(x) dx 
Set: value = 0.0, 
Evaluate: h, = (b — a)/2,c =a + hy, Fa = f(a), 
Fe = f(c), Fb = f(b), Sab = S,(a, b). 
STORE(a, Fa, Fe, Fb, hy, tol, Sab). 
REPEAT 
RETRIEVE(a, Fa, Fc, Fb, hy, tol, Sab) 
Evaluate: hy = h,/2,d = a + hy, e = a + 3hy, Fd = f(d), 
Fe = f(e), 
Sac = S,a, c), Scb = S3(c, b), Sx(a, b) = Sac + Scb. 
IF |S,(a, b) — S,(a, b)| < tol THEN 
Compute RS(a, b) 
value = value + RS(a, b) 


ELSE 


hy = hy, tol = tol/2 
STORE(a, Fa, Fd, Fc, hy, tol, Sac) 
STORE(c, Fc, Fe, Fb, hy, tol, Scb) 
UNTIL TABLE is EMPTY. 
I(f) = value. 
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(Note that in the FORTRAN implementation of this algorithm we will use the well- 
known structure from computer science known as a stack. Our table of seven columns 
is just an n X 7 two-dimensional stack. To store a set of values, we do an operation 
called PUSH; to retrieve a set, we do a POP.) 


APPLICATIONS OF CUBIC SPLINE FUNCTIONS 


In addition to their obvious use for interpolation, splines (Chapter 3) can be used for 
finding derivatives and integrals of functions, even when the function is known only as 
a table of values. The smoothness of splines. because of the requirement that each portion 
have the same first and second derivatives as its neighbor, can give improved accuracy 
in some cases. 

For the cubic spline that approximates f(x), we can write, for the interval x, = x = 
Xie 


f(x) = a — xP + 6. -— 4F + ox — x) + &, 


where the coefficients are determined as in Section 3.9. The method outlined in that 
section computes S, and S;.,. the values of the second derivative at each end of the 
subinterval. From these S-values and the values of f(x), we compute the coefficients of 
the cubic: 


a SSS 
* 64; 2, — %) 

S, 
b = 
—_ Lj xy) — FO) _ Wee y — 4S; + Coy — HSi- 
Xp — % 6 sd 
d, = f(x). 


Approximating the first and second derivatives is straightforward: we estimate these 
as the values of the derivatives of the cubic: 


f'() = 3a,(x — x,F + 26,4 — x) + 
f'(x) = 6a,(x — x,) + 25;. 
At the n + 1 points x; where the function is known and the spline matches f(x), these 
formulas are particularly simple: 
f'@) = &, 
f°) = 2;. 


(We note that a cubic spline is not useful for approximating derivatives of order higher 
than the second. A higher degree of spline function would be required for these.) 


EXAMPLE 


Table 4.6 
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Approximating the integral of f(x) over the n intervals where f(x) is approximated 
by the spline is similarly straightforward: 


Fett pXart 
F(x) dx = | P(x) de 
x x 
a Mel 
= > [ae —x)t+ Bx - x) + fy —xP + d(x - »)| 
fs 2 x 
- > [a + 4 bi 3. & > 
1 qe — a + 3 ie - xy + 3 ier x)" + dixj+1 — X)) }- 


If the intervals are all the same size. (h = — x,), this equation becomes 


(laot nt LJ 
| find ==> "9 +h> 
ty 45 i 


n 


dj. 


We illustrate the use of splines to compute derivatives and integrals by a simple 
example. 


Compute the integral and derivatives of f(x) = sin mx over the interval 0 = x = 1 from 
the spline that fits at x = 0, 0.25, 0.5. 0.75. and 1.0. (See Table 4.6.) We use end 
condition L: S$; = 0, Ss = 0. Solving for the coefficients of the cubic spline, we get the 
results shown in Table 4.7 

The derivatives can now be computed: 


Atx = 0.25: f(x) = 2.2164 fe) = -7.344 
(exact = 2.2214); (exact = —6.979); 

Atx = 0.50: fin =0 fe) = —10.387 
(exact = 0); (exact = —9.870); 

Atx = 0.75: f'() = —2.2164 f(x) = -7.344 
(exact = 14); (exact = —6.979). 


The errors of the first derivatives computed from the cubic spline are smaller than those 
from a fourth-degree interpolating polynomial; the errors of the second derivatives from 
the spline are somewhat larger than from the polynomial. 
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i z 5 « 5 & 2 

i o 0 30 0 3.130 oO 

2 03s 73 —los 3.6720 22164 0.7071 
3 050 —10.3872 ye —3.1936 oO 10 

+ 075 7340 +60 3.672) —22164 0.7071 


aS Fo + OS 125376) + CSF 


oY 
[rm a& = (3.1340) 


+ 0.25¢2.4142) 
= 0.6362 (exact = 0.6366; error = +0.0004). 


For this example. the accuracy is significantly better than that obtained with Simpson's 
3 Mule, which gives 0.6381, exror = —0.0015. = 


MULTIPLE INTEGRALS 


We consider firs the case when the limits of integration are constants. In the calculus 
we learned that 2 double integral may be evaluated as an iterated integral: in other words. 
we may write 

er eS) rd pip ce 

| |r naa =[LU. fix») d| de = LU. fix, y) dx) dy. (4.60) 

4 


In Eq. (4.60) the rectangular region A is bounded by the lines 
z=a, x=b Fra yd 


In computing the iterated integrals. we hold x constant while integrating with respect to 
¥ (vice versa mm the second case). 

Adapting the numerical imtegration formulas developed earlier im this chapter for 
integration with respect to just one independent variable is quite straightforward. Recall 
that any of the imtegration formulas are just a linear combination of values of the function, 
evaluated at varying values of the independent variable. In other words, a quadrature 
formula is just 2 weighted sam of certain functional values. The immer integral is written 
then as 2 weighted sum of fanction values with one variable held constant. We then add 
together 2 weighted sum of these sums. If the function is known only at the nodes of a 
rectangular grid through the region. we are constrained to use these values. The Newton— 
Cotes formulas are 2 convenient set to employ. There is no reason why the same formula 
must be used im cach direction. although it is often particularily convenient to do so. 


EXAMPLE 


Table 4.8 
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an = g 
We illustrate this technique by evaluating thé integral of the function of Table 4.8 over 
the rectangular region bounded by 


x=15, x=3.0, y=0:2, y=0:6. 


Let us use the trapezoidal rule in the x-direction and Simpson's 5 rule in the y-direction. 
(Since the number of panels in the x-direction is not even, Simpson’s 3 + rule does not 
apply readily.) It is immaterial which integral we evaluate first. Suppose we start with y 
constant: 


i] 


i] 


3.0 3.0 
h 
0.2: I _ fix y) de I. f(x, 0.2) dx = 3(f, + fy + fy + fe) 


5 
259.990 + 2(1.568) + 2(2.520) + 4.090] 


3.0 
5 
0.3: Lg f(x, 0.3) dx = O51) 524 + 2(2,384) + 2(3.800) + 6.136] 


= 5.0070. 


So 


Similarly, at 


0.4, I= 6.6522; 
y=0.5, / = 8.2368; 
y=06, [= 9.7435. 


y 


W 


We now sum these in the y-direction according to Simpson’s rule: 


O113.3140 + 4(5.0070) + 2(6.6522) + 4(8.2368) + 9.7435] 


SOx, y) dx 


= 2.6446. « 


' 


Tabulation of a function of two variables, u = f(x, y) 


0.1 0.2 0.3 O4 0.5 0.6 
0.165 0.428 0.687 0,942 1.190 1.431 
0.271 0.640 1.003 1.359 1.703 2.035 
0.447 0.990 1.524 2.045 2.549 3.031 
0.738 1.568 2.384 Sige 3.943 4.672 
1.216 2.520 3.800 5.044 6.241 7.379 
2.005 4.090 6.136 8.122 10,030 11.841 


3.306 6.679 9.986 13.196 16.277 19.198 
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(In this example our answer does not check well with the analytical value of 2.5944 
because the x-intervals are large. We could improve our estimate somewhat by fitting a 
higher-degree polyndmial than the first to provide the integration formula. We can even 
use values outside the range of integration for this, using the techniques of Section 4.11 
to derive the formulas.) 

The previous example shows that double integration by numerical means reduces to 
a double summation of weighted function values. The calculations we have just made 
could be written in the form 


[sen aray = % a 


Sy, 1+ fa + 2fsa + fa) 


+ in. + 2f22 + 22 + far) 
tees is + 2frs + 2fss + fr sdl- 
It is convenient to write this in a pictorial operator form, in which the weighting factors 


are displayed in an array that is a map to the location of the functional values to which 
they are applied. 


L424 4% 
AyAr}]2 8 4 8 2 

| f(x, y) dx dy “35 > 8 478 2h (4.61) 
igs 22 ¢ 


We interpret the numbers in the array of Eq. (4.61) in this manner: We use the values 
1, 4, 2, 4, and 1 as weighting factors for functional values in the top row of the portion 
of Table 4.8 that we integrate over (values where x = 1.5 and y varies from 0.2 to 0.6.) 
Similarly. the second column of the array in Eq. (4.61) represents weighting factors that 
are applied to a column of function values where y = 0.4 and x varies from 1.5 to 3.0. 
Observe that the values in the pictorial operator of Eq. (4.61) follow immediately from 
the Newton—Cotes coefficients for single-variable integration. 

Other combinations of Newton—Cotes formulas give similar results. It is probably 
easiest for hand calculation to use these pictorial integration operators. Pictorial integration 
is readily adapted to any desired combination of integration formulas. Except for the 
difficulty of representation beyond two dimensions, this operator technique also applies 
to triple and quadruple integrals. 

There is an alternative representation to such pictorial operators that is easier to 
translate into a computer program. We also defive it somewhat differently. Consider the 
numerical integration formula for one variable 


1 fn 
Lf adx= > a; f(x;)- (4.62) 


We have seen in Section 4.12 that such formulas can be made exact if f(x) is any 
polynomial of a certain degree. Assume that Eq. (4.62) holds for polynomials up to 
degree s.* 


igh gre Boo 5s =m — I for meven and s = » for m odd. For Gaussian quadrature formulas, 
5 = 2n — 1, and the x, will be unevenly spaced. 
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We now consider the multiple integral formula 


n n 


1 pl fl n 
[, [, [fe y, 2) dx dy dz = = 2 2 a, ajay f(x; , Yj. 2). (4.63) 


We wish to show that Eq. (4.63) is exact for all polynomials in x, y, and z up to 
degree s. Such a polynomial is a linear combination of terms of the form x*yz, where 
a, B, and y are nonnegative integers whose sum is equal to s or less. If we can prove 
that Eq. (4.63) holds for the general term of this form, it will then hold for the polynomial. 

To do this we assume that 


f(x, y, z) = x%y8zy, 


Then. since the limits are constants and the integrand is factorable, 


r= [ffi evtrdcaya 
=(fa)([/,¢a)([), 2a). 


Replacing each term according to Eq. (4.62). we get, 
hn ra n n n fh 
l= (> a, xf i > ay4)(> 4,2?) = >» axed aye az}. (4.64) 


We need now an elementary rule about the product of summations. We illustrate it for a 
simple case. We assert that 


(Su )(Su)=3 


The last equality is purely notational. We prove the first by expanding both sides: 
3 2 
A 'F 
= (uy + Uy + Ug)(vy + V2) 

= UyVy + UyVz + UQvy + UgV2 + UAV, + U3V3i 
3 2 
> > Uj; = (Uyvy + MyV2) + (Ugvy + Ugv2) + (Uv, + 43V2)- 


On removing parentheses we see the two sides are the same. Using this principle, we 
can write Eq. (4.64) in the form 


nnn 
= > » 24 a;a,a,x7y8zz, (4.63) 
Ape 


which shows that the questioned equality of Eq. (4.63) is valid, and we can write a 
program for a triple integral by three nested DO loops. The coefficients a; are chosen 
from any numerical integration formula. If the three one-variable formulas corresponding 
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to Eq. (4.63) are not identical, an obvious modification of Eq. (4.65) applies. In some 
cases a change of variable is needed to correspond to Eq. (4.62). 

If we are evaluating a multiple integral numerically where the integrand is a known 
function, our choice of the form of Eq. (4.62) is wider. Of higher efficiency than the 
Newton—Cotes formulas is Gaussian quadrature. Since it also fits the pattern of Eq. (4.62) 
the formula of Eq. (4.65) applies. We illustrate this with a simple example. 


1 1 
i= yze* dx dy dz 
lo J-1 J-1 


by Gaussian quadrature using a three-term formula for x and two-term formulas for y and 
z. We first make the changes of variables to adjust the limits for y and z to (—1, 1): 


Evaluate 


Our integral becomes 
I= z{, [, fs (u — 1)(v + le* dx du dv. 
The two- and three-point Gaussian formulas are, from Section 4.12: 
[se dx = (1)f(-0.5774) + (1) (0.5774), 
[70 dx = (3)f(-0.7746) + (§)ro “ (§) f0.7746). 


The integral is then 


3 


1= 43 > D aah; + 0; - Ver, 


16 = 1 A 
a= 1, a,=1 

5 8 5 
b= 5: ba = 95> bs = 5; 


and values of u, v. and x as above. 
A few representative terms of the sum are 


ea (3) _ xe = [en 0-7446 
i 16 aen(s)¢ 0.5774 + 1)(—0.5774 — De 
ig 3 
2B ann(s (—0.5774 + 1)(—0.5774 — le 


+ caeay(§)(-0.5774 + 1(-0.5774 — 1)e0-746 


4.17 
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+ cnan(3)(o.s774 + 1)(—0.5774 — 1)e707746 


+e] 


On evaluating, we get / = —0.58758. The analytical value is 


-ite —e ') = —0.58760. = 


ERRORS IN MULTIPLE INTEGRATION AND EXTRAPOLATIONS 


The error term of a one-variable quadrature formula is an additive one just like the other 

terms in the linear combination (although of special form). It would seem reasonable that 

it would go through the multiple summations in a similar fashion, so we should expect 

error terms for multiple integration that are analogous to the one-dimensional case. We 

illustrate that this is true for double integration using the trapezoidal rule in both directions, 

with uniform spacings, choosing n intervals in the x-direction and m in the y-direction. 
From Section 4.8 we have 


i] 


‘b id 
Error of [ f@ide= 2 TEAC) = On), 


b-a 
a 


h 


Ax = 


In developing Romberg integration, we found that the error term could be written as 
Error = O(h?) = Ah? + O(h*) = Ah? + Bhi, 


where A is a constant and the value of B depends on a fourth derivative of the function. 
Appending this error term to the trapezoidal rule, we get (equivalent to Eq. (4.62)) 


h 2 
= 30.) + 2fi pt hy tess thal + Al? + Biht. 


y=y; 


b 
[ F(x) dx 


a 


Summing these in the y-direction and retaining only the error terms. we have 


n 


=0 


dpb a k ; 
[ [ flx, y) dx dy = ae >» 2, a,ajfi,j + 5(Ag + 2A, + 2AQ +++ + A,_)A 


+ KB, + 2B, + 2B, + +--+ B,,)h* + Ak? + BRA, (4.66) 


k= dy= 22. 
m 


In Eq. (4.66), A and B are the coefficients of the error term for y. The coefficients A and 
B for the error terms in the x-direction may be different for each of the (m + 1) y-values, 
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but each of the sums in parentheses in Eq. (4.66) is 2n times some average value of A 
or B, so the error terms become 


Error = Kon, + FnB i + Ak? + BK. (4.67) 


Since both Ax ard Ay are constant, we may take Ay = k = aAx = ah, where a = 
Ay/Ax, and Eq. (4.67) can be written, with nh = (b — a), 


Bie 
2 
= K,h? + Kh‘. 


Error = ( Awa)? + (2 5"8..0)H + Aa?h? + Bath* 


Here K> will depend on fourth-order partial derivatives. This confirms our expectation 
that the error term of double integration by numerical means is of the same form as for 
single integration. 

Since this is true, a Romberg integration may be applied to multiple integration, 
whereby we extrapolate to an O(h*) estimate from two trapezoidal computations at a 2- 
to-1 interval ratio. From two such O(h*) computations we may extrapolate to one of O(h°) 
error. 


MULTIPLE INTEGRATION WITH VARIABLE LIMITS 


If the limits of integration are not constant, so that the region in the x,y-plane upon which 
the integrand f(x, y) is to be summed is not rectangular, we must modify the above 
procedure. Let us consider a simple example to illustrate the method. Evaluate 


| [re y) dy dx 


over the region bounded by the lines x = 0, x = 1, y = 0, and the curve y = x? + 1. 
The region is sketched in Fig. 4.7. If we draw vertical lines spaced at Ax = 0.2 apart, 
shown as dashed lines in Fig. 4.7, it is obvious that we can approximate the inner integral 
at constant x-values along any one of the vertical lines (including x = 0 and x = 1). If 
we use the trapezoidal rule with five panels for each of these, we get the series of sums 


5, =f, + fy + Ye + Yat Met H, 
hy 

2=TUe + 2f, + 2f, + 2fj + 26, + fp, 
hg 


S3= 2Um + Bato 


S6=FUu ees gee) ate Pe ee 


= =e 


Figure 4.7 
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cr ° ay 
j . 
Pe 
+ 
are 
p—_+___s___+ —-. x 
: go \ 


The subscripts here indicate the values of the function at the points so labeled in 
Fig. 4.7. The values of the A, are not the same in the above equations, but in each they 
are the vertical distances divided by 5. The combination of these sums to give an estimate 
of the double integral will then be 


ry 
Integral = 2245, + 2S, + 283 + 28S, + 285 + S¢). 


To be even more specific, suppose f(x, y) = xy. Then 


_ 1.0/5 


Sy zy (0 +0+0+04+040)=0, 
1.04/5 

Sz = J —(0 + 0.0832 + 0.1664 + 0.2496 + 0.3328 + 0.208) = 0.1082, 
1.16/5 

S; = —>—(0 + 0.1856 + 0.3712 + 0.5568 + 0.7428 + 0.464) = 0.2692, 
1,36/5 

Ss = ZO + 0.3264 + 0.6528 + 0.9792 + 1.3056 + 0.816) = 0.5549, 
1.64/5 e 

Ss = S(O + 0.5248 + 1.0496 + 1.5744 + 2.0992 + 1.312) = 1.0758, 

Se = 2.0/5 9 + 0.8 + 1.6 + 2.4 + 3.2 + 2.0) = 2.0; 


2 


u 


CHAPTER 4 NUMERICAL DIFFERENTIATION AND NUMERICAL INTEGRATION 


Integral = %20 + 0.2164 + 0.5384 + 1.1098 + 2.1516 + 2.0) 
=0.6016 versus analytical value of 0.583333. 


The extension of this to more complicated regions and the adaptation to the use of 
Simpson’s rule should be obvious. If the functions that define the region are not single- 
valued, one must divide the region into subregions to avoid the problem, but this must 
also be done to integrate analytically. 

The previous calculations were not very accurate because the trapezoidal rule has 
relatively large errors. Gaussian quadrature should be an improvement, even using fewer 
points within the region. Let us use 3-point quadrature in the x-direction and 4-point 
quadrature in the y-direction. As in Section 4.12, we must change the limits of integration: 


1 pti 
[fee 
10 Jo 


ie [ s+ If G%s) + Dr + 5) + D 
Behleiez 


to 


2? 


eras 


in which we made the following substitutions: 


2 + (x%s) + 
x=i} _ 2 Din ds 


The integral is approximated by the sum 
=. oe 
ppy wi Wf (s;. t), 
i=1 j=1 


where the w,’s, W's, 5;"s, and t;’s are the values taken from Table 4.4. Using that table, 
we set w,; = 0.55555555, w; = w,, and w, = 0.88888889; we set s; = —0.77459667, 
53 = —5,, and s, = 0.0. The values for the W;’s and 1,’s are obtained in the same way. 
For each fixed i, i = 1, 2, 3, let S; be the corresponding value obtained using Gaussian 
quadrature for a fixed s,, where S; = Sai Wf(s;, )- 

The following intermediate values are easily verified: 


S; = (0.00279158 + 0.02487506 + 0.05050174 + 0.03741447) = 0.11558285, 


Sz = (0.01886891 + 0.16813600 + 0.34135240 + 0.25289269) = 0.78125000, 
S3 = (0.06845742 + 0.61000649 + 1.23844492 + 0.91750833) = 2.83441716. 


We sum these as follows: 
eS Tags ae = 0.58333334 
which agrees with the exact answer to seven places. In this case we used only 12 


evaluations of the function (exceptionally simple to do here, but usually more costly) 
compared to the 36 used with the trapezoidal rule. 
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To keep track of the intermediate computations, it is convenient to use a template 


such as 


5 yh 5 


and the S,’s are computed along the verticals. 


CHAPTER SUMMARY 


If you are not able to do all of the following. we advise you to restudy portions of this 
chapter. You should now be able to: 


nN 


10. 


Develop formulas for differentiation and integration from an_ interpolating 
polynomial. 

Derive an expression for the error terms for the formulas developed in the preceding 
question. 

Use a lozenge diagram to write several formulas for differentiation and distinguish 
whether they are based on a forward, backward, or other type of interpolating 
polynomial. 

Derive formulas by the symbolic method and from Taylor series. 


Apply formulas to obtain derivatives and integrals and estimate the error in the 
result. 


Use extrapolation techniques to get improved estimates for the values of the deriv- 
ative and integral. You should be able to explain why these methods work. 
Explain how round-off errors influence the accuracy of the results and why differ- 
entiation is more susceptible to such errors than integration. 

Evaluate an integral using Gaussian quadrature. You should be able to develop a 
Legendre polynomial of degree n from the recursion formula and to tell what is 
meant by orthogonal function. 

Apply the various integration procedures to improper and indefinite integrals and 
to multiple integrals. 

Understand the principle of adaptive integration well enough to apply it to a simple 
example. 
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Figure 4.8 


ll. Use a cubic spline fitted to a set of data to obtain the derivative and integral. You 
should be able to explain why a spline may give improved accuracy. 

12. Utilize the FORTRAN programs given in this chapter to obtain values for derivatives 
and integrals. 
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COMPUTER PROGRAMS 


We conclude this chapter with a series of computer programs. Programs | (Fig. 4.9) and 
2 (Fig. 4.10) perform differentiation. The first is a subroutine named DERIV that finds 
the first derivative of tabular data with equispaced x-values. Parabolas are passed through 
groups of three points and the derivative is computed as the slope of the parabola at the 
point in question. For each interior point, this amounts to using a central-difference formula 
for the derivative of f(x). The endpoints use three-point formulas for the derivative, but 
the point at which the derivative is estimated is not a central point. The lozenge diagram 
of Fig. 4.1 is the source of these formulas; they have been rewritten in terms of function 
values in the routine. The subroutine is passed the values of the function in an array, the 
number of points, and the uniform x-interval. An array holding the derivative values is 
returned. 

When the tabular data given below were passed to the subroutine, the results in Fig. 
4.8 were obtained. The data are recordings of the speed of a car at 10-second intervals: 


Time, sec | 0 10 20 30 40 50 60 70 80 90 100 


Speed, mph 48 38 31 33 36 41 51 43 35 29 28 


Program 1 Output 
Function Values and Derivative Values 


Point # F DF 
1 0.480000E 02 —0.115000E 01 
2 0.380000E 02 —0.850000E 00 
3 0.310000E 02 —0.250000E 00 
4 0.330000E 02 0.250000E 00 
5 0.360000E 02 0.400000E 00 
6 0.410000E 02 0.750000E 00 
ti 0.510000E 02 0.100000E 00 
8 0.430000E 02 —0.800000E 00 
9 0.350000E 02 —0.700000E 00 
10 0.290000E 02 —0.350000E 00 
11 0.280000E 02 0.150000E 00 


Cc 
Cc 
Cc 
c 
Cc 
c 
c 
Cc 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 


aaaaa 


a9a000 
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Figure 4.9 
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SUBROUTINE DERIV(F,DF,N,H) 


SUBROUTINE 


THIS SUBROUTINE FINDS THE FIRST DERIVAT. 
EQUISPACED DATA. CENTRAL DIFFERENCE FORMULAS ARE USED FOR ALL 
EXCEPT THE FIRST AND THE LAST. FOR THESE, A QUADRATIC IS PASSED 
THROUGH THREE SUCCESSIVE POINTS TO OBTAIN THE DERIVATIVE. 


PARAMETERS ARE : 


RETURNED TO 


2a” 
4 


AT X(2) 


GET IT AT X(N) 


DE(N) = ¢ 
RETURN 


END 


Program 1. 


In view of the discussion of “‘noisy data” in Section 4.6, what can you say about 
the accuracy of these results? How could the accuracy of the acceleration determination 
be improved? 

The second program (DER2), in Fig. 4.10, is a subroutine that finds the first and 
second derivatives of a function whose form is known. The name of the function is passed 
as a parameter; it must therefore be declared EXTERNAL in the calling program. Central- 
difference formulas are used, values of f(x) being used at the distance H on either side 
of the point where the derivatives are desired. This routine and a modified version that 
employed double-precision arithmetic were used to create the tables in Section 4.6 that 
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SUBROUTINE DER2 (FCT,X,H,DX,DDX) 


SUBROUTINE DER2 

THIS SUBROUTINE COMPUTES THE FIRST AND SECOND 
DERIVATIVES OF THE FUNCTION FCT AT THE POINT X. CENTRAL DIFFERENCE 
FORMULAS ARE USED, EMPLOYING A STEP SIZE EQUAL TO H. 


PARAMETERS ARE : 


FUNCTIONS WHOSE DERIVATIVES ARE DESIRED. MUST BE DECLARED 
EXTERNAL IN CALLING PROGRAM. 

POINT AT WHICH DERIVATIVES ARE TO BE CALCULATED. 

STEP SIZE TO BE USED. 

RETURNS THE FIRST DERIVATIVE TO THE CALLER. 

RETURNS THE SECOND DERIVATIVE TO THE CALLER. 


Cc 
c 
Cc 
Cc 
c 
Cc 
Cc 
Cc 
Cc 
c 
Cc 
¢€ 
c 
c 
c 
Cc 
Cc 
Cc 
c 
Cc 
Cc 


FCT, X,H,DX, DDX, FO, FR, FL 


FCT (X) 

FCT (X+H) 

FCT (X-H) 

(FR - FL) / 2.0 / 8 
DDX = ( FR - 2.0*FO + FL) /H/H 
RETURN 
END 


Figure 4.10 Program 2. 


illustrate how the best accuracy in computing derivatives occurs at a step size where the 
sum of round-off and truncation errors is minimized. 

Four integration programs are presented. Program 3 (Fig. 4.11) is a subroutine named 
SIMPS, designed to integrate uniformly spaced data that are known from a table. A 
combination of Simpson’s 5 +and ; ; rules are used when there is an odd number of intervals. 
If there is an even number, only ‘the tule is used, which normally gives better accuracy. 
An array containing the function attics is passed, together with H = Ax and the number 
of points. The program first checks to see whether the number of panels is odd or even. 
If it is odd, one application of the ; 2 rule is performed to integrate over the first three 
panels, and the 4 3 Tule is applied over all the others. The parameter RESULT returns the 
value of the integral. When this subroutine is applied to the velocity data tabulated with 
Program 1, the result obtained is 3726.67 mile-sec/hr. If we unscramble the mixed time 
units by dividing by 3600, we find that the distance traveled, as indicated by the speed- 
ometer, is 1.035 mi. 

Program 4 (Fig. 4.12) is a subroutine that applies the Romberg method to integrate 
a function. The name of the function is passed as a parameter (thus it must be declared 
EXTERNAL), along with the limits of integration and a tolerance value to control the 
error. RESULT returns the final value to the caller, but intermediate results are printed 
out by the subroutine. The value of H is halved at each stage of the integration until TOL 
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SUBROUTINE SIMPS(F,N,H, RESULT) 


SUBROUTINE SIMPS : 
THIS ROUTINE PERFORMS SIMPSON’S RULE INTEGRATION 
OF A FUNCTION DEFINED BY A TABLE OF EQUISPACED VALUES. 


PARAMETERS ARE : 


FE - ARRAY OF VALUES OF THE FUNCTION 

N NUMBER OF POINTS 

H THE UNIFORM SPACING BETWEEN X VALUES 

RESULT ESTIMATE OF THE INTEGRAL THAT IS RETURNED TO THE CALLER 


aaaaAAaAAAAAAAAAAAAAN Aaa 


REAL F(1024),H,RESULT 
INTEGER N, NPANEL, NHALF, NBEGIN, NEND 


CHECK TO SEE IF NUMBER OF PANELS IS EVEN. NUMBER OF PANELS IS N - 


NPANEL = N - 1 

NHALF = NPANEL / 2 

NBEGIN a 

RESULT = 0.0 

IF ( (NPANEL - 2*NHALF) .NE. 0 ) THEN 


NUMBER OF PANELS IS ODD. USE 3/8 RULE ON FIRST THREE, 1/3 RULE ON REST. 
RESULT = 3.0*H/8.0*( F(1) + 3.0*F(2) + 3.0*F(3) + F(4) ) 


NBEGIN = 4 
IF (N .EQ. 4 ) RETURN 


APPLY 1/3 RULE - ADD IN FIRST, SECOND AND LAST VALUES. 


RESULT RESULT + H/3.0*( F(NBEGIN) + 4.0*F(NBEGIN+1) + F(N) 
NBEGIN NBEGIN + 2 
IF ( NBEGIN .EQ. N ) RETURN 


THE PATTERN AFTER NBEGIN+2 IS REPETITIVE. GET NEND, THE PLACE TO STOP. 


NEND = N - 2 

po 10 I NBEGIN, NEND, 2 

RESULT = RESULT + H/3.0*( 2.0*F(I) + 4.0*F(I+1) ) 
RETURN 

END 


Figure 4.11 = Program 3. 
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PROGRAM PROMB (INPUT, OUTPUT) 


REAL FCT, TOL, TOTAL, A,B, RESULT 

INTEGER I 

EXTERNAL FCT 

DATA TOL,TOTAL,A,B / 0.0001,0.0,0.0,1.0 / 


DOO =e 
CALL ROMB (FCT,A,B, TOL, RESULT) 
TOTAL = TOTAL + RESULT 
PRINT 200, B,RESULT, TOTAL 
IF ( ABS(RESULT) .LT. 0.001 ) STOP 
A=B 


B = B*10 
10 CONTINUE 


200 FORMAT(/’ B = ‘,F7.0,’ INCREMENT IS ’,E£15.7,’ INTEGRAL IS ’, 
+ E15.7//) 


REAL FUNCTION FCT (X) 


REAL X 
FCT = X * EXP (-X) 


SUBROUTINE ROMB : 


SUBROUTINE FOR ROMBERG INTEGRATION. PROGRAM BEGINS 
WITH TRAPEZOIDAL INTEGRATION WITH 10 SUBINTERVALS. INTERVALS ARE THEN 
HALVED AND RESULTS ARE EXTRAPOLATED UP TO EIGHTH ORDER. MAXIMUM 
NUMBER OF SUBINTERVALS USED IN PROGRAM IS 2560. 


PARAMETERS ARE : 


FCN - FUNCTION THAT COMPUTES F(X), DECLARED EXTERNAL IN MAIN 
A,B - INITIAL AND FINAL X VALUES 

TOL - TOLERANCE VALUE USED TO TERMINATE ITERATIONS 

RESULT - RETURNS VALUE OF INTEGRAL TO CALLER 

TRAP - DOUBLY SUBSCRIPTED ARRAY THAT HOLDS INTERMEDIATE VALUES 


FOR COMPARISONS AND EXTRAPOLATION 
KFLAG - FLAG. USED INTERNALLY TO SIGNAL NON-CONVERGENCE. WHEN 
KFLAG=0, MEANS NON-CONVERGENT, =1 MEANS ALL OK. 


aaARAAAAAAARAAAAAAAAAAAAAA 


Figure 4.12 Program 4. 
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Figure 4.12 (continued) 


AAAQa aAAaaAaAa 


aaaaaan 


aaaaa 


aaaaa 


REAL FCN,A,B,TOL,H,SUM,X, TRAP (9,9) 
INTEGER I,L,K,KFLAG 


SET FLAG AT 1 INITIALLY 
KFLAG = 1 
COMPUTE FIRST INTEGRAL WITH 10 SUBINTERVALS AND USING TRAP RULE 


B- A) / 10.0 
FCN(A) + FCN(B) 


+ FCN(X)*2.0 


10 CONTINUE 
TRAP (1,1) = H / 2.0 * SUM 


RECOMPUTE INTEGRAL WITH H HALVED, EXTRAPOLATE AND TEST, REPEAT 
UP TO EIGHT TIMES. 


2 
+ FCN(X)*2.0 
+H 


30 CONTINUE 

TRAP(1,I+1) = H / 2.0 * SUM 

Do 40 L = 1,1 

TRAP (L+1,I+1) = TRAP(L,I+1) + 1.0/(4.0**L - 1.0) * 
+ (TRAP (L,I+1) - TRAP(L,I)) 

40 CONTINUE 

IF ( ABS(TRAP(I+1,I+1) - TRAP(I,I+1)) - TOL ) 50,50,20 
20 CONTINUE 


IF TOLERANCE NOT MET AFTER 8 EXTRAPOLATIONS, PRINT NOTE & SET KFLAG = 0. 


KFLAG = 0 
PRINT 200 


PRINT INTERMEDIATE RESULTS 


52 i= 2+ 

po 70 L = 1,1 

PRINT 203, (TRAP (J,L),J=1,L) 

70 CONTINUE 

IF ( KFLAG .EQ. 0 ) STOP 

RESULT = TRAP (I,1) 
200 FORMAT(/’ TOLERANCE NOT MET. CALCULATED VALUES WERE ') 
203 FORMAT (1X, 8F12.6) 

RETURN 

END 


Pe 
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Figure 4.12 (continued) 


OUTPUT“FOR PROGRAM 4 


- 263408 
+ 264033 264241 
- 264189 - 264241 264241 


1. INCREMENT IS -2642411E+00 INTEGRAL IS «26424115700 


+ 735878 
+ 735294 ~735100 
«735260 735249 -735259 


10. INCREMENT IS -7352591E+00 INTEGRAL Is -9995002E+00 
002044 
001055 000725 
000661 - 000530 -000517 


100. INCREMENT IS -5171925E-03 INTEGRAL IS 


is met on that stage. The program avoids recomputing function values that have been 
used in prior stages. 

A main program and a function are also shown with Program 4, illustrating how the 
subroutine can be used to find an integral with infinite upper limit. The integral being 


evaluated is 
= 
[ xe de, 
0 


whose analytical result is unity. The routine, with TOL = 0.0001, finds a result of almost 
perfect accuracy. In the program, the upper limit is increased progressively until a neg- 
ligibly small increment is computed. 

The fifth program (see Fig. 4.13) is based on the adaptive Simpson method of Section 
4.14. The example executed by this program is the very same one in that section. The 
integrand is evaluated in FUNCTION F(X). SUBROUTINE ADAPT follows the algo- 
rithm presented in Section 4.14 very closely. A Romberg refinement is made once the 
integral value estimates. $1AB and S2AB, are sufficiently close. Moreover, as the limits 
of integration are reduced, we also reduce the stopping criterion, TOL. by the same 
factor. The PUSH and POP subroutines make the implementation of the algorithm 
straightforward. 

The sixth program (DBLINT) (see Fig. 4.14) integrates a function of two independent 
variables over a rectangular region. The Romberg method again is used in the program, 
and the name of the function is passed as a parameter to provide greater flexibility. This 
program is similar to Program 4, except that a series of integrals in the x-direction is 
combined to integrate in the y-direction; a function subprogram is called to do this. 
Program 6 is less efficient than Program 4 in that some function values are recomputed 
as the step size is reduced. 
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PROGRAM SLOAST (IME CT, OPIS OT} 


TSS SSCCsM TASLAEMTS TEE SROTIVS SIMSON SVTSSasT Ie 
SSESEMTED IW SaCrams & 


HAAAanaAne 


= &,5,T0L, 2Swee 
DISS WANT, POOKY 
EIUSSNaL F 

COMMIT ETT, FONT 


Gata 2,5,T5L/0-Z, 2-0, 


c 

¢ : 
\OXIeT = 
Fomor = 
C255 £5057 (F, 2,5,T0, 

at SS2ST SSSESS = TE Ss 


SUSSOUTEsE Se 
e 
bod 
¢ 
c 


Toe , WET, Foe 
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Figure 4.13 (continued) 


COUNT = 0 

H1l = (B-A)/2.0 
C=A + Hl 

FA = F(A) 
FB = F(B) 
FC = F(C) 

S1AB = H1*(FA + 4.0*FC + FB)/3.0 
TOP = 0 

CALL PUSH (A,FA,FC,FB,H1, TOL, S1AB) 


_ 


10 CONTINUE 
COUNT = COUNT + 1 
CALL POP(A,FA,FC,FB,H1,TOL, S1AB) 
H2 = H1/2.0 
D=A+ H2 
E =A + 3.0*H2 
C=A+ Hl 
B=A + 2.0*H1 
FD = F(D) 
FE = F(E) 
S2AC = H2* (FA + 4,0*FD + FC)/3.0 
S2CB = H2*(FC + 4.0*FE + FB)/3.0 


c 
S2AB = S2AC + S2CB 
c 
IF (ABS(S2AB - S1AB) .LT. TOL) THEN 
STORE = STORE + S2AB + (S2AB - S1AB)/15.0 
ELSE 
Hl = #2 
TOL = TOL/2 
CALL PUSH(A,FA,FD,FC,H1,TOL, S2AC) 
CALL PUSH(C, FC, FE,FB,H1,TOL, S2CB) 
END IF 
c 
IF (TOP .GT. 0 .AND. COUNT .LT. MAXCNT) GO TO 10 
c 
c RETURN ANSWER 
c 
ANSWER = STORE 
RETURN 
END 
c 
OG sasccbeen took See i ee ees 
€ 
c SUBROUTINE PUSH WHICH STORES THE VALUES IN A STACK 
c 
(@ sseeeeeasseereetasske dhcseseceneate sass Soon ee ee 
c 
SUBROUTINE PUSH (LPT, LVAL,MIDVAL,RTVAL, STPSZE, TOL, INTGRL) 
c 
REAL LPT, LVAL, MIDVAL, RTVAL, STPSZE, TOL, INTGRL 
REAL STACK (7,100) 
INTEGER TOP 
COMMON /SUB/ STACK, TOP 
ra 


TOP = TOP + 1 
STACK(1,TOP) = LPT 


STACK (2, TOP) LVAL 
STACK (3, TOP) MIDVAL 
STACK (4, TOP) RIVAL 
STACK (5, TOP) STPSZE 


STACK(6,TOP) = TOL 
STACK(7,TOP) = INTGRL 


ee 


Figure 4.13 


aaanaaa 


aaaaaana 


(continued ) 


RETURN 
END 


SUBROUTINE POP (LPT, LVAL,MIDVAL, RTVAL, STPSZE, 


REAL LPT, LVAL, MIDVAL, RTVAL, STPSZE, TOL, INTGRL 


REAL STACK (7,100) 
INTEGER TOP 
COMMON /SUB/ STACK, TOP 


LPT = STACK(1, TOP) 
LVAL = STACK (2, TOP) 
MIDVAL = STACK (3,TOP) 
RTVAL = STACK (4, TOP) 
STPSZE = STACK (5, TOP) 
TOL = STACK (6,TOP) 
INTGRL = STACK (7, TOP) 
TOP = TOP - 1 

RETURN 

END 


REAL FUNCTION F(X) 

REAL X 

INTEGER MAXCNT, FCNCNT 

COMMON MAXCNT, FCNCNT 
FCNCNT = FCNCNT + 1 
F = 1.0/(X*X) 

RETURN 

END 


OUTPUT FOR PROGRAM 5 


THE INTEGRAL VALUE IS 4.000060 


THE NUMBER OF FUNCTION CALLS 
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TOL, INTGRL) 
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The seventh and final program of this section (see Fig. 4.15) is one that implements 
the Gaussian quadrature method. For simplicity, the subprogram GAUSSQ can solve only 
for 2, 3, or 4 points. The choice is determined by the value given to the variable— 
NTERMS—which can only handle values for 2, 3, or 4. It would be easy to add additional 
T and W values in the IF (NTERMS.EQ. ). . . statement for additional points. Although 
we use only 4 points in this example program, the solution compares very well with the 
exact integral, which is 0.6588234. It would be quite easy to extend this program to two 
dimensions. The integrand is defined in FUNCTION F(X), and only the limits of inte- 
gration, A, B, and the number of terms, NTERMS, need to be input. 
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SUBROUTINE DBLINT (FCT, XA, XB, YA, YB, TOL, RESULT) 
— 


SUBROUTINE DBLINT : 

THIS ROUTINE COMPUTES THE INTEGRAL OF A 
FUNCTION OF TWO VARIABLES. THE ROMBERG METHOD IS USED. INITIALLY, 
FOUR SUBDIVISIONS ARE USED. THESE ARE HALVED UNTIL THE TOLERANCE 
IS MET, WITH A MAXIMUM NUMBER OF SUBDIVIDINGS OF FIVE. 


FUNCTION SUBPROGRAM TO COMPUTE F(X,Y). DECLARED EXTERNAL 
IN CALLING PROGRAM. 

LOWER AND UPPER LIMITS FOR X. 

LOWER AND UPPER LIMITS FOR Y. 

TOLERANCE TO TERMINATE INTEGRATION. WHEN NOT MET, A 
MESSAGE IS PRINTED AND LAST VALUE RETURNED. 

RETURNS VALUE OF INTEGRAL TO CALLER. 

DOUBLY SUBSCRIPTED ARRAY TO HOLD INTERMEDIATE VALUES 
FOR COMPARISON AND EXTRAPOLATION. 


A FUNCTION SUBPROGRAM NAMED SUMROW, IS CALLED TO COMPUTE SUMS 
ACROSS ONE ROW OF THE REGION. 


c 
c 
c 
c 
c 
c 
Cc 
c 
c 
c 
c 
c 
c 
c 
c 
c 
S 
c 
c 
c 
c 
c 
c 


FCT, XA, XB, YA, YB, TOL, RESULT 
iN, o,K 
ARERE CEs 6) , DELX, DELY, SUMROW, Y 


DEL VALUES AND SUM TOP AND BOTTOM ROWS 


XA) / 4.0 
YA) / 4.0 


¥(1,1) SUMROW (FCT, XA, XB, YA, DELX, N) 
SUMROW (FCT, XA, XB, YB, DELX, N) 


1) + 2.0*SUMROW (FCT, XA, XB, Y, DELX,N) 


ARRAY (1,1) = ARRAY(1,1) * DELX * DELY / 4.0 


NOW HALVE THE VALUES OF DELX AND DELY, RECOMPUTE THE INTEGRAL, AND 
EXTRAPOLATE, THEN TEST TO SEE IF TOLERANCE IS MET. REPEAT UP TO 
FIVE TIMES. 


aannaaa 


Figure 4.14 Program 6. 
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Figure 4.14 (continued) 


aaqQq0Cn 


aaqa0a aaa 


aaqn0aa 


agaeaqaaAAAAADA AAA 


Do 40 J = 1,5 
DELX = DELX / 2.0 
DELY = DELY / 2,0 
N = 2*N 


DO TOP AND BOTTOM ROWS FIRST 


ARRAY (J+1,1) = SUMROW(FCT,XA,XB,YA,DELX,N) + 
+ SUMROW (FCT, XA, XB, YB, DELX,N) 


THEN THE INTERMEDIATE ROWS 


Y = YA 
DO 20 I = 2,N 
Y = Y + DELY 
ARRAY (J+1,1) = ARRAY (J+1,1) + 2.0*SUMROW(FCT,XA,XB, 
20 CONTINUE 
ARRAY (J+1,1) = ARRAY(J+1,1) * DELX * DELY / 4.0 


NOW WE EXTRAPOLATE 


Do 30 K=1,0 
ARRAY (J+1,K+1) = ARRAY (J+1,K) + 1.0 / (4.0**K - 1.0) * 
+ ( ARRAY (J+1,K) - ARRAY(J,K) ) 
30 CONTINUE 
IF ( ABS (ARRAY (J+1,J+1) - ARRAY (J+1,d)) - TOL ) 50,50,40 
40 CONTINUE 


NOT MET, SO PRINT MESSAGE AND RETURN. 


PRINT 201, TOL 
RESULT = ARRAY (6,6) 


RETURN 
50 RESULT = ARRAY (J+1,J+1) 
RETURN 
201 FORMAT(/’ TOLERANCE OF ’,E£14.7,’ NOT MET AFTER FIVE ', 
+ ‘ EXTRAPOLATIONS.’ /) 
END 


REAL FUNCTION SUMROW (FCT, XA, XB, Y,DELX,N) 


FUNCTION SUMROW : 

THIS FUNCTION COMPUTES THE WEIGHTED SUM R 
TRAPEZOIDAL RULE INTEGRATION ACROSS ONE ROW OF A REGION, FROM 
XA TO XB WITH INTERVALS OF DELX, WHERE THE VALUE OF Y IS Y. 


PARAMETERS ARE : 


333 
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Figure 4.14 (continued) 


FCT - EXTERNAL FUNCTION THAT COMPUTES F (X,Y) 
XA,XB - LIMITS FOR*X VALUES 

Y - VALUE OF Y 

DELX - STEP SIZE FOR X 

N - NUMBER OF INTERVALS 


GET FIRST AND LAST VALUES TO START. 
SUMROW = FCT(XA,Y) + FCT (XB,Y) 


NOW ADD IN THE INTERMEDIATE VALUES. 


Q22Q AAAAAAAAAA 


SUMROW JUMROW + 2.0 * FCT (X,Y) 
CONTINUE 
RETURN 
END 


PROGRAM GAUSQDT (INPUT, OUTPUT) 


THIS PROGRAM IMPLEMENTS THE GAUSSIAN QUADRATURE METHOD AS PRESENTED 
IN SECTION 4.12. IT CAN HANDLE ONLY 2,3,4 TERMS DEPENDING ON THE 
PARAMETER, NTERMS. 


THIS PROGRAM SOLVES THE EXAMPLE IN SECTION 4.12. 


c 
c 
c 
Cc 
c 
c 
c 
Cc 
c 
c 
Cc 


REAL A,B, GAUSSQ 
EXTERNAL F 


DATA A,B,NTERMS/0.2,1.5,4/ 
PRINT '(//)! 


PRINT 100 , GAUSSQ(F,A,B,NTERMS) 
FORMAT (’THE VALUE OF THE INTEGRAL IS: ’, F10.7///) 


STOP 
END 
Cc 
Cc FUNCTION GAUSSQ(A,B,NTERMS) 
Cc 
REAL FUNCTION GAUSSQ(F,A,B,NTERMS) 
c 
Cc 
Cc 
c INPUT: A,B - INTERVAL [A,B] OVER WHICH FUNCTION IS TO BE 
c INTEGRATED. 
Cc NTERMS - NUMBER OF TERMS TO BE USED 
c F - THE INTEGRAND FUNCTION 


Figure 4.15 Program 7. 
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Figure 4.15 (continued) 


aaaaaanaaaaaaaa 


aaqaan0aa 


aaaaa 


OUTPUT: GAUSSQ - THE VALUE OF THE INTEGRAL 


LOCAL VARIABLES: 


BPLUSA: (B + A)/2 

BLESSA: (B - A)/2 

T: THE VECTOR OF T-VALUES IN [-1,1] 
W: THE VECTOR OF WEIGHTS 


REAL A,B, T(4), W(4), BPLUSA, BLESSA, SUM 
INTEGER NTERMS, I,J 


IF (NTERMS .EQ. 2) THEN 


T(1) -0.5773502691 
T(2) = T(1) 
Ww(1) 1.0 

W(2) = 1.0 


ELSE IF (NTERMS .EQ. 3) THEN 


T(1) -0.7745966691 
T(2) 0.0 

T(3) -T(1) 

W(1) 0.55555555556 


W(2) = 0.88888888889 
W(3) = W(i) 
ELSE IF (NTERMS .EQ. 4) THEN 


T(1) = -0,.8611363115 
T(2) = -0.3399810435 
T(3) = -T(2) 
T(4) = -T (2) 
W(1) = 0.3478548451 
W(2) = 0,.6521451549 
W(3) = W(2) 
W(4) = W(1) 
END IF 


BLESSA = (B - A)/2.0 
BPLUSA = (B + A)/2.0 
SUM = 0. 
Do 10 I 1, NTERMS 
SUM SUM + W(I)*F(BLESSA*T(I) + BPLUSA) 
CONTINUE 


who 


GAUSSQ = BLESSA * SUM 
RETURN 
END 


DEFINE FUNCTION TO BE INTEGRATED 


335 


= 
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Figure 4.15 (continued) 


REAL FUNCTION F(X) 


REAL X 

F = 1.0/EXP (X*X) 
RETURN 

END 


OUTPUT FOR PROGRAM 7 


THE VALUE OF THE INTEGRAL IS: + 6588291 


EXERCISES 


Section 4.2 


> 1, The following table is for (1 + log x). Determine estimates of d(1 + log x)/dx at x = 0.15, 
0.19, and 0.23, using (a) one term, (b) two terms, and (c) three terms of Eq. (4.9). By 
comparing to the analytical values, determine the errors of each estimate. 


x 1 + log x A »¥ 

0.15 0.1761 

0.0543 
0.17 0.2304 —0.0059 

0.0484 0.0009 
0.19 0.2788 —0.0050 

0.0434 0.0011 
0.21 0.3222 —0.0039 

0.0395 0.0006 
0.23 0.3617 —0,0033 

0.0362 0.0005 
0.25 0.3979 —0.0027 

0.0335 0.0002 
0.27 0.4314 —0.0025 

0.0310 0.0005 
0.29 0.4624 —0.0020 

0.0290 
0.31 0.4914 


= 
(\ 2), Write expressions for the errors in each computation of Exercise 1, by properly interpreting 
Eg. (4.11). From these expressions find upper and lower bounds for each of the computations. 
If one wished the derivative of (1 + log x) at x = 0.31, the table in Exercise 1 would be 
inadequate for this, using Eg. (4.9), because the necessary forward differences cannot be 


Kad 


EXERCISES 337 


computed. A formula in terms of backward differences can be derived, however, by differ- 
entiating the Newton—Gregory backward-interpolation polynomial. Do this, and use three 
terms to obtain d(1 + log x)/dx at x = 0.31 from the data given in the table of Exercise 1. 


Derive the error term for the formula of Exercise 3, and show that the “next-term rule” 
applies. 

A central-difference approximation for derivatives is more accurate than a forward- or back- 
ward-difference approximation, as discussed in Section 4.2. Repeat Exercise 1(b), but use 
Eq. (4.13) and compare the errors. 


Find bounds on the errors of Exercise 5 and compare the actual errors to the bounds. 


Exercises | through 6 used values from a table with equispaced x-values. The data below 
are not equispaced, so Eq. (4.4) is needed to compute the derivative and errors are given by 
Eq. (4.5). Use these to compute 

a) f’(1.4) with one, two, three terms of Eq. (4.4) 

b) /'(2.1) with one, two, three terms of Eq. (4.4). 

¢) Estimate the errors in parts (a) and (b). 


v [us 18 19 21 


fix) 2.8867 3.4017 3.5355 3.7947 


Section 4.3 


8. 


The Newton-Gregory backward-interpolating formula can be developed by expanding the 
symbolic relation 

fi = Ef = (1 — Vf. 
Using this representation of the interpolation formula. derive the derivative formula required 
in Exercise 3, by the symbolic technique. 


Suppose the nature of the function tabulated in Exercise 1 were unknown. How could estimates 
of the errors of each computation in that problem be determined? Compare with the error 
bounds determined in Exercise 2. 


Section 4.4 


10. 


If one begins with Stirling’s interpolation formula (Section 3.5), differentiates, and then 
evaluates at x = x9, one obtains a derivative formula in terms of central differences. Show 
that, when this is done, the coefficients are the same as those given by a horizontal path 
through the lozenge diagram, Fig. 4.1, starting at fo. 


If one differentiates Stirling’s formula twice and then sets s = 0, a central-difference formula 
for the second derivative results. Show that the coefficients so obtained match those on a 
horizontal path from fo in Fig. 4.3. 

Make a lozenge diagram, similar to Figs. 4.1 and 4.3, for the third derivative. 

If the difference table of Exercise 1 is extended to the right, differences up to the eighth order 
can be calculated. The lozenge diagram, Fig. 4.1, can also be extended to a similar order of 
differences. Discuss whether the accuracy of estimates of the derivative of the logarithm 
function will be improved by such extensions. 

An alternative way of deriving error terms for derivative formulas is through Taylor series 
expansions. For example, the error term of Eq. (4.23) can be obtained by expanding f, = 
f(% + h) and f_; = f(xy — A) each about the point x = x9, and then combining the two 
series. Show that the error term of Eq. (4.23) is — 4/2/"(€). 


{\ 
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15. For unevenly spaced data, a derivative expression analogous to Eg. (4.23) is 
' - f(x) — fxs) 
ol F'@) Sore 
1 ca Was | 
In this case x, — X) = X — x_;. Take x; — % = h and x — x_,; = ah, a = 1. Show by 
the Taylor series method that the error is now O(h) and not O(h) unless a = 1. 
16. Show that the results of Exercise 15 are exactly what is given by divided-difference formulas. 
Section 4.5 
17) Use the extrapolation method to determine f’(0.23) for the data of Exercise 1, to an O(#*) 
extrapolation. 

18. Show that the first-order extrapolation for f' (xp) with Ax-values differing by 2:1 is the same 

as the formula 
5 be (Af + fo 1862+ Bf, 
fo=j 2 6 2 ). 
where A is the smaller of the two Ax-values. Note that this comes directly from the lozenge 
diagram. 

19. Show that a second-order extrapolation for f’(x,) is the equivalent of using central differences 
through A*f. 

20. Consider whether extrapolation similar to that of Section 4.5 can be performed when deriv- 
atives for unevenly spaced data are estimated through divided differences. You may want to 
use Taylor series expansions as in Exercise 14. If you succeed in getting a formula, apply it 
to get f'(1.8) from the data in Exercise 7. 

Section 4.6 

21. The data below are for tan @ in the neighborhood of # = 1 radian. 


22) 
23. 
mA. 


>a) Truncating the data to four significant digits, compute the derivative at @ = 1.0, using a 
central-difference formula (Eg. (4.8)) but with varying step size. At what point is the 
greatest accuracy obtained? 

»b) Repeat part (a) but with the data rounded to four digits. 

) Repeat with the data as given below, which are rounded to six digits. 


E 


@ tan @ 6 tan 6 
0.9000 1.26016 1.0001 1.55775 
0.9900 1.52368 1.0010 1.56084 
0.9990 1.55399 1.0100 1.59221 
0.9999 1.55706 1.1000 1.96476 
1.0000 1.55741 


Repeat Exercise 21, but use forward- and backward-difference formulas. 

Repeat Exercise 21, but use a central-difference formula for second derivatives. 

Use Program 2 (Fig. 4.10) to determine the optimum step size for most accurate computa- 
tion of first and second derivatives for a number of functions, including sin x, cos x, sinh x, 
cosh x, In x. If your computer has double precision in FORTRAN (also if extended precision 
is available), modify the program to investigate for these cases. 
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Section 4.7 


25. 


Equations (4.36) through (4.41) are Newton—Cotes formulas, and their error terms are derived 
from forward-difference interpolating polynomials. Rederive these, beginning with backward- 
difference formulas. 


»26. Prove the assertion made at the end of Section 4.7 that all Newton—Cotes integration formulas 
of even order have an order of h in the error term one greater than would normally be expected. 

27. Derive the Newton—Cotes formulas of orders 4 and 5. 

28. Beginning with the divided-difference form of the interpolating polynomial, rederive the 
Newton—Cotes formulas of orders 1, 2, and 3. (Assume, of course, that (x,,, — x)) = A for 
all i.) 

29, Continue Exercise 28 to derive the error terms. 

Section 4.8 

»30, The following values of a function are given. 
Find [| §f(x) dx, using the trapezoidal rule with 
abh=O1 b) h = 0.2 co) h=04 

31. The function tabulated in Exercise 30 is cosh x. What are the errors of the computations in 
parts (a), (b), (c)? How closely are the errors proportional to A?? What other errors besides 
that of Eq. (4.45) are present? 

32. In Section 4.8, the value of f} fe" dx is computed as 23.9944 using the trapezoidal rule with 
h = 0.2, and is in error by —0,08. Equation (4.45) shows this to fall within the predicted 
bounds for the error. Suppose we did not know what f(v) was; estimate the error using second- 
order differences of the data in Table 4.3 

»33. If one wished to compute f} $e" dx using the trapezoidal rule, and wished to be certain of 
five-decimal accuracy (|error| < 0.000005), how small must A be? 

34. Find f@ fix) de: 


x fix) x f(x) 
a 
0 1.0000 | 1.08 0.3396 
0.12 0.8869 | 1.43 0.2393 
0.53 0.5886 } 2.00 0.1353 
0.87 0.4190 
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Section 4.9 
35. Extrapolate the individual answers of Exercise 30 to get estimates of improved accuracy. 
What are the orders of the errors of these extrapolations? 
»36. Use extrapolation to the limit to evaluate [4 f(x) dx, getting a result with error O(hS): 
x f(x) x fo) 
0 0.3989 0.50 0.3521 
0.25 0.3867 0.75 0.3011 
1.00 0.2420 
37. Use Romberg integration (successive extrapolations with h halved each time) to evaluate 


J} dx/x. Carry six decimals and continue until no change in the fifth place occurs. Compare 
to the analytical value In 2 = 0.69315. 


Section 4.10 


38. 
39, 


»40. 


43. 
m4. 


45. 


Repeat Exercise 30 using Simpson's 4 rule with (a) h = 0.1, (b) A = 0.2, (c) h = 0.4. 


Use the error expression of Eq. (4.49) to estimate the maximum and minimum errors to be 
expected in Exercise 38. (f(x) in Exercise 30 is cosh x.) 


Evaluate f},° e* dx by Simpson’s } rule, choosing h small enough to guarantee five-decimal 
aecuracy. How large can h be? 


In the text it is stated that the results of using Simpson's } rule and the extrapolated result 
from two applications of the trapezoidal rule (with h changed by 2:1) are identical. Show 
that this will always be true. 


Compute 

' sin x 

10 x 
using Simpson’s } rule with h = 0.5 and with h = 0,25. Extrapolate the results. What is the 
order of the error for the extrapolated result? 
Repeat Exercise 40, but use Simpson’s 3 rule. 
Apply a combination of the two Simpson rules to evaluate the integral from x = 3 to x = 
6.5. Note that this is a technique to handle an odd number of panels: 


x fix) fe f(x) 
3.0 0.33906 5.0 —0,32758 
3.5 0.13738 5.5 —0.34144 
4.0 —0.06604 6.0 —0.27668 
4.5 —0.23106 6.5 —0.15384 


The fact that Simpson’s 4 rule has an error term dependent on the third derivative and not 
the second (as is also true for the 3 rule) means that these integrations are exact for any cubic 
polynomial. However, the 4 rule is based on fitting a quadratic through three equispaced 
points. The implication of this is that the area under any cubic, from x = a to x = b, is 


eS > 
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exactly the same as the area under the parabola that intersects the cubic at the two endpoints, 
and also at x = ia + b). Prove this. 


Section 4.11 
»46. Use the symbolic method to determine a formula for three-panel integration, using a poly- 
nomial of degree m. 

47. Integration from x9 to x;, as in Eq. (4.51), is an arbitrary choice taken for convenience only. 
If the limits were taken from x_, to x9, a similar formula for one-panel integration would 
result, but different coefficients would be obtained. Carry out this computation. 

48. Perform a computation similar to Exercise 47, but get a two-panel integration formula from 
X_, lox. 

49, Use the method of undetermined coefficients to derive (a) the trapezoidal rule, (b) Simpson's 
3 rule 

50. Use the method of undetermined coefficients to derive the central-difference formulas, Eqs. 


(4.23) and (4.24), for f’(xp) and f"(xo). 


Section 4.12 


eS. 


53; 


Evaluate 


jo x 


by a three-term Gaussian quadrature formula. 

By computing with Gaussian quadrature formulas of increasing complexity, determine how 
many terms are needed to evaluate J}{ e* dx to five-decimal accuracy. (The exact value of 
the integral is 23.9144526.) 

An n-term Gaussian quadrature formula assumes that a polynomial of degree 2n — | is used 
to fit the function between x = a and x = b. Does this mean that the error term is the same 
as for the Newton—Cotes integration formulas based on polynomials of (2n — 1) degree? 


Section 4.13 


34. 


»56. 


As discussed in Section 4.13, an integral with infinite limits can be approximated numerically 
if it is convergent. For example, the exponential integral Ei(x) can be evaluated by taking 
the upper limit U sufficiently large in 

= y-v U o-v 

Ei(x) = I ave I = av. 

rv ev 
We know that U is “sufficiently large” when the additional contributions of making U larger 
are negligible. Estimate Ei(0.5) using the trapezoidal rule. Note that one can use larger 
subintervals as v increases, Compare to the tabular value of 0.5598. 


Apply the method of Section 4.13 to show that 


Pe alk 
| eH? dy = 1, 
ae 2 


(The values of the integrand are available as ordinates of the standard normal curve.) 


Evaluate these integrals 


2 dx b) ('__ de ° x dk 
# I — ) [ oh I a (Analytical value is 77/12) 
0 Vx eS Ug Ce ale oat | 


\\ 
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57. 


>58. 


59. 


In integrating [2 f(x) dx, Gaussian quadrature does not require the value of f(x) at the endpoints. 
Therefore that technique has special utility when there are singularities at the endpoints of 
the interval of integration. Apply this to Exercise 56 (a) and (b). 


It often helps to break up an integral into parts. Consider 


Cate 
oet+er 


Atx = 10, e~* is only 4 x 1075, while e* is 22,026.5; for x > 10, the integral can be 
considered to be just J dx/e*. Evaluate the integral numerically as 


i CP ier 
iat thee Jolene 10 ev" 


One must be cautious in integrating on a computer. fj, dx/x surely diverges, but you will 
probably not find this to be true when calculating numerically because the small values of 
delta x near x = 0 may cancel the contribution to the total. It will depend on how one defines 
the step size as x gets small. See what you get as a result of numerically evaluating this 
integral. 


Section 4.14 


60. 


6l. 


a) Use a calculator to verify the results in the example of Section 4.14. 

b) Redo the example of Section 4.14 using an adaptive trapezoidal method. How many 
function evaluations would be required? Compare this number with the number required 
if you did not use an adaptive method. 


Most programs will compute the appropriate step size A in the manner described in the 
discussion of adaptive integration. However, this could lead to significant errors. For instance, 


ae _@ 
i sin*(16x) dx = me 

but it is easy to see that $\(0, 7/2) = S,(0, 7/2) = 0, where hy = (77/4) and hy = (77/8). 
How should one solve this problem correctly using the adaptive method described in this 
section? (It is interesting to see that the integration function on the HP-15C avoids this error.) 


Section 4.15 


>62, 


63. 


Consider this table of data (which are obviously for f(x) = 1/x). 


‘ | 10 15 2.0 2.5 3.0 


feo 1.000 0.667 0.500 0.400 0.333 


Find values for f’(x) and f"(x) atx = 1.5, 2.0, and 2.5 from the cubic spline functions that 
approximate f(x). Compare to analytical values for f’(x) and f"(x) to determine their errors. 
Also compare to derivative values computed from central-difference formulas. 


The comparison in Exercise 62 between the spline and central-difference values may favor 
the spline method because it is based on approximation by a cubic while the central-difference 
formulas approximate f(x) with a quadratic. Suppose we fit the function by polynomials of 
degree 3 and 4. How do estimates of f’(x) ad f"(x) from these approximations compare to 
those from the cubic splines? 

The natural spline in Exercise 62 is handicapped by setting the second derivatives to zero at 
the endpoints, while the correct values are 2 and 2/27. Repeat Exercise 63 for a spline that 
has the correct values for its second derivative at x = 1 and x = 3, 


P65. 


66. 


67. 
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Find fj sech x dx by integrating the natural cubic spline that fits at five equispaced points on 
{0, 2]. Compare to the analytical value; also compare to the Simpson’s 3 rule value. 

In Exercise 65, the natural spline assumes that the second derivatives at the extremes of the 
interval are zero. How does the value of the integral change when conditions 2, 3, and 4 are 
employed? 

The best values to use in a cubic spline for the values of its second derivatives at the endpoints 
should be the correct values of f(x). Repeat Exercise 65, using the correct values for f(x) 
where f(x) = sech x. 


Section 4.16 


68, 


69, 


70 


Ts 


>72. | 


In connection with the first example of Section 4.16, it was stated that it is immaterial which 

integral of a double integral (with constant limits) we integrate first. Confirm this by evaluating 

SERS F(x, y) dy dx, with fOc, y) given by Table 4.8, performing the integration first with 

respect to y, 

Write the pictorial operators similar to that portrayed in Eq. (4.61) that result by using: 

a) Simpson's 4 rule in the y-direction and the trapezoidal rule in the x-direction; 

b) Simpson’s 4 rule in both directions; 

c) Simpson's ; rule in both directions. 

d) What conditions are placed on the number of panels in each direction by the methods 
employed in parts (a), (b), and (c)? 

Since Simpson’s § rule is accurate for a cubic, evaluation of the triple integral below using 

this rule should be exact. Confirm this by evaluating both numerically and analytically: 


rip pt 
| [ I xyz? dx dy dz. 
lo Jo Jo 


Use Eq. (4.65) adapted to this case. 


Draw a pictorial operator (three dimensions) to represent the formula used in Exercise 70. It 
is perhaps easiest to do this with three widely separated planes on which one indicates the 
coefficients (see illustration). 


Evaluate 


07 (06 
[ | e* sin y dy dx 
0.4 J-.2 


a) Using the trapezoidal rule in both directions, Ax = Ay = 0.1. 
b) Using Simpson’s 4 rule in both directions, Ax = Ay = 0.1. 
c) Using Gaussian quadrature, three-term formulas in both directions. 


Section 4.18 


B. 


Solve Exercise 72 by extrapolating from the trapezoidal-rule evaluations using Ax = 
Ay = 0.2, and Ax = Ay = 0.1. This should give the same result as part (b) of Exercise 72. 
Does it? 


—s 
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74. Imegrate [3f3 (> + 3°) dx dy using the trapezoidal mile with varyme Ax and Ay. Show that 
the emrors decrease proportionately to #_ 


Section 4.19 
»75. Integrate [f sm x sim y dr dy over the region defined by that portion of the unit circle that 
les im the first quadrant. Integrate first with respect to x holding y constant with Ar = 0.25. 
2) Use the trapezoidal rule. b) Use Simpson's 3 rule. 
76. The order of integration in multiple integration may usually be changed. Evaluate the integral 
of Exercise 75 by imtegrating first with respect to y, holding x constant. The imiegral then 


Becomes 
= 
i i sin x sim y dy dx 


»77. Integrate the function ¢~**" over the region bounded by the two parabolas y = x7 and y = 
2x? — 1. Note that the integrand is an even function and that the region is symmetrical about 
the y-axis. so that the integral over half the region may be evaluated and then doubled. Choose 
reasonable values for Ax and Ay. 

78. a) Use Gaussian quadrature to solve Exercise 75. Use three-point formulas. 
b) Solve Exercise 77 using four-point Gaussian quadrature formulas. 


AFPLIED PROBLEMS AND PROJECTS 


79. Differential thermal analysis is a specialized techmigue that can be used to determine transition 
lemperatures and the thermodynamics of chemical reactions. It has special application in the 
study of minerals and clays. Vold (Amal. Chem_ 21_ 683 (1949)) describes the techmgue. In 
this method, the temperature of 2 sample of the material being studied is compared to the 
temperature of an inert reference material when both are heated simultaneously under identical 
conditions. The furnace housing the two materials is normally heated so that its temperature 
T, increases (approximately) linearly with time (1). and the difference m temperatures (AJ) 
between the sample and the reference is recorded. Some typical data are 


1, min 0 1 2 a 4 3 6 7 


AT, °F | 0.00 ox 1.86 4.32 8.07 13.12 16.80 18.95 


‘he 


TF | 82 88 94 90 27 943 99 975 
e1 e 9 10 rT 2 B “4 15 16 
| 18.07 1669 1525 1386 1258 140 1033 895 6.46 
T, | 992 1008 123 1039 1055 1071 1086 1102 i118 
: | 7 18 19 20 2 2 3B Ey 2s 
ar| 465 337 24 176 12% O88 06 042 030 
71135 M51 68 N84 120 1216 132 1239 1265 


The AT values increase to 2 maximum, then decrease. This is due to the heat evolved m an 
exothermic reaction. One item of imicrest is the time (and furnace temperature) when the 
reaction is complete. Vold shows that the logarithm of AJ should decrease linearly after the 


80. 


81. 
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reaction is over; while the chemical reaction is occurring, the data depart from this linear 
relation. She used a graphical method to find this point. Perform numerical computations to 
find, from the above data, the time and the furnace temperature when the reaction terminates. 
Compare the merits of doing it graphically or numerically. 

The temperature difference data in Problem 79 can be used to compute the heat of reaction. 
To do this, the integral of the values of AT are required, from the point where the reaction 
begins (which is at the point where AT becomes nonzero) to the time when the reaction 
ceases, as found in Exercise 79. Determine the value of the required integral. Which of the 
methods of this chapter should give the best value for the integral? 


Fugacity is a term used by engineers to describe the available work from an isothermal 
process, For an ideal gas, the fugacity f is equal to its pressure P, but for real gases, 


mt ee 


=e pe ae 


where C is the experimentally determined compressibility factor. For methane, values of C 
are 


P (atm) P (atm) Cc 
1 0.9940 80. 0.3429 
10 0.9370 120 0.4259 
20 0.8683 | 160 0.5252 
40 0.7043 250 0.7468 
60 oasis | 400 1.0980 


Write a program that reads in the P and C values and uses them to compute and print f 
corresponding to each pressure given in the table. Assume that the value of C varies linearly 
between the tabulated values (a more precise assumption would fit a polynomial to the tabulated 
C values), The value of C approaches 1.0 as P approaches 0. 

The stress developed in a rectangular bar when it is twisted can be computed if one knows 
the values of a torsion function U that satisfies a certain partial-differential equation, Chapter 
7 describes a numerical method that can determine values of U. To compute the stress, it is 
necessary to integrate JU dx dy over the rectangular region for which the data given below 
apply. (You may be able to simplify the integration because of the symmetry in the data.) 
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83. 


84. 


85. 


Make a critical comparison of the accuracy of Newton—Cotes integration formulas compared 
to Gaussian quadrature. Test the formulas for a variety of functions for which you can calculate 
the integrals analytically. Select some functions that are smooth, some that have sharp changes 
in value, and some with periodic behavior. 


Write a general-purpose subroutine that performs Gaussian quadrature. You should have the 
subroutine change the limits of the integration appropriately and call a function subprogram 
to compute function values, with the name of the function subprogram passed as an argument. 
It should also receive as an argument the degree of the formula to be employed. 


The data in Exercise 62 exhibit round-off in the second and in the last entries. By recalculating 
the spline function, with more and less accurate values of f(1.5) and f(3.0), determine how 
the values of S are affected by the precision of the data. Also calculate how the precision 
affects the estimates of f'(x) and f"(x). Compare the effects of the precision of the original 
data, using a cubic spline, with the effects of precision on the values of f'(x) and f"(x) when 
central-difference formulas are used. 


= > 


Numerical Solution of 
Ordinary Differential 
Equations 


5.0 CONTENTS OF THIS CHAPTER 


Ordinary differential equations describe many physical situations: spring—mass 
systems, resistor—capacitor—inductance circuits, bending of beams, chemical 
reactions, pendulums, and so on. Their prominence in applied mathematics is 
due to the fact that most scientific laws are more readily expressed in terms of 
rates of change. For example, 


du 
—= -0,.2 - 5/4 
ai 0.27(u — 60) 


is an equation describing (approximately) the rate of change of temperature u 
of a body losing heat by natural convection with constant-temperature surround- 
ings. This is termed a first-order differential equation because the highest-order 
derivative is the first. 

If the equation contains derivatives of nth order, it is said to be an nth-order 


differential equation. For example, a second-order equation describing the oscil- 
lation of a weight acted upon by a spring, with resistance to motion proportional 
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to the square of the velocity, might be 
at 2. 
es +44) Sax oer 


where x is the displacement and f is time. 

The solution to a differential equation is the function that satisfies the dif- 
ferential equation and that also satisfies certain initial conditions on the function. 
In solving a differential equation analytically, one usually finds a general solution 
containing arbitrary constants and then evaluates the arbitrary constants so that 
the expression agrees with the initial conditions. For an nth-order equation, n 
independent initial conditions must usually be known. The analytical methods 
are limited to certain special forms of equations; elementary courses normally 
treat only linear equations with constant coefficients when the degree of the 
equation is higher than the first. Neither of the above examples is linear. 

Numerical methods have no such limitations to only standard forms. We 
obtain the solution as a tabulation of the values of the function at various values 
of the independent variable, however. and not as a functional relationship. We 
must also pay a price for our ability to solve practically any equation in that we 
must recompute the entire table if the initial conditions are changed. 

Our procedure will be to explore several methods of solving first-order 
equations, and then to show how these same methods can be applied to systems 
of simultaneous first-order equations and to higher-order differential equations. 
We will use for our typical first-order equation the form 


dy _ : 
a = fix. y). 


y%) = Yo- 


5.1 POPULATION CHARACTERISTICS OF FIELD MICE 
Outlines a typical problem from the field of biology/ecology that involves a dif- 
ferential equation. ( 

5.2 TAYLOR-SERIES METHOD 
Is 2 straightforward adaptation of classical calculus to develop the solution as an 
infinite series. The catch is that a computer usually cannot be programmed to 
construct the terms and one doesn’t know how many terms should be used. 

5.3 EULER AND MODIFIED EULER METHODS 
Are simple to use but subject to error unless the step size Ax is made very small. 

5.4 RUNGE-KUTTA METHODS 
Are very popular because of their good efficiency; these are used in most computer 
programs for differential equations. These are single-step methods. as are the Euler 
methods. In this section we compare the methods presented so far. 


5.1 


5.5 


5.7 


5.9 


5.10 


5.11 


5.12 


5.1: POPULATION CHARACTERISTICS OF FIELD MICE 


MULTISTEP METHODS 

Are even more efficient than the previous methods but cannot be used at the 
beginning of the interval of integration; they can be employed after several steps 
with a single-step method. These methods have a built-in ability to monitor the 
error in the solution. 

MILNE’S METHOD 

Is a multistep method that appears very attractive until one finds that it may be 
unstable. 

ADAMS-MOULTON METHOD 

Is a multistep method that does not suffer from the fault of instability, This section 
compares the methods of Adams—Moulton and Milne for accuracy and stability 
through an example. 

CONVERGENCE CRITERIA 

Impose additional limitations on the step size beyond the requirement for accuracy; 
here you are exposed to some theory. 

ERRORS AND ERROR PROPAGATION 

Examines the various sources of error in the context of solving differential equa- 
tions; the important concept of propagated error is examined in a simple case, 
again giving you a taste of theoretical numerical analysis. 

SYSTEMS OF EQUATIONS AND HIGHER-ORDER EQUATIONS 

Are the usual situations in applied problems; fortunately our methods are readily 
extended to cover them. The various methods are again applied to the same example 
to show how they compare. 

COMPARISON OF METHODS 

Summarizes the various methods and makes a critical comparison of their strengths 
and weaknesses. The somewhat esoteric notion of “stiff” equations is examined. 
CHAPTER SUMMARY 

Is our usual review section. 

COMPUTER PROGRAMS 


Gives programs for modified Euler and fourth-order Runge-Kutta procedures that 
can be applied to single equations, systems, or higher-order equations. 
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POPULATION CHARACTERISTICS OF FIELD MICE 


An ecologist has been studying the effects of the environment on the population of field 
mice. Her research shows that the number of mice born each month is proportional to 
the number of females in the group, and that the fraction of females is normally constant 
in any group. This implies that the number of births per month is proportional to the 
total population. 
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Table 5.1 


5.2 


She has located a test plot for further research, which is a restricted area of semiarid 
land. She has constructed barriers around the plot so mice cannot enter or leave. Under 
the conditions of the.experiment, the food supply is limited, and it is found that the death 
rate is affected as a result, with mice dying of starvation at a rate proportional to some 
power of the population. (She also hypothesizes that when the mother is undernourished, 
the babies have less chance for survival, and that starving males tend to attack each other, 
but these factors are only speculation.) 

The net result of this scientific analysis is the following equation, with N being the 
number of mice at time ¢ (with ¢ expressed in months.) She has come to you for help in 
solving the equation; her calculus doesn’t seem to apply. 


aN _ an — BNI 
i = ON BNI, 


with B given by Table 5.1. 


As the season progresses, the amount of vegetation varies. She accounts for this change 
in the food supply by using a “constant” B that varies with the season. 

If 100 mice were initially released into the test plot and if a = 0.9, estimate the 
number of mice as a function of 1, fort = 0 tor = 8. 

In this chapter we will find how such a problem can be solved by numerical methods. 


t B | t B 
a 
0 0.0070 5 0.0013 
1 0.0036 | 6 0.0028 
2 0.0011 7 0,0043 
3 0.0001 8 0.0056 
4 0.0004 


TAYLOR-SERIES METHOD 


The first method we discuss is not strictly a numerical method, but it is sometimes used 
in conjunction with the numerical schemes, is of general applicability, and serves as an 
introduction to the other techniques we will study. Consider the example problem 
y(0) = 1. (5.1) 


(This particularly simple example is chosen to illustrate the method so that you can readily 
check the computational work. The analytical solution, y = 2e* — x — 1, is obtained 
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immediately by application of the standard methods, and will be compared with our 
numerical results to show exactly the error at any step.) 

We develop the relation between y and x by finding the coefficients of the Taylor 
series in which we expand y about the point x = x9: 


P * 
Y(8) = yo) + y" CHa Nox — 39) + IDG — gy? + Ga — ayy ++ 


If we let x — xy = h, we can write the series as 


yx) = yxq) + y'(ap)h + > (0) jp + rads ewarans 


Since y(xp) is our initial condition, the first term is known from the initial condition 
y(0) = 1. (Since the expansion is about the point x = 0, our Taylor series is 
actually a Maclaurin series in this example.) 

We get the coefficient of the second term by substituting x = 0, y = 1 into the 
equation for the first derivative, Eq. (5.1): 


y'(xo) = y'(0) =O +1 = 1. 


We get equations for the second- and higher-order derivatives by successively dif- 
ferentiating the equation for the first derivative, Each of these is evaluated corresponding 
to x = 0 to get the following: 


yYa=ilt+y', y(0) =1+1=2, t\ 


yW"@=y", y"(0) = 2, 
yx) = y", y'(0) = 2, 
and so on. y’(0) = 2. 


(Getting the derivatives is deceptively easy in this example. You should compare this to 
the function f(x, y) = x/y, with y(1) = 1.) 

We then write our series solution for y, letting x = h be the value at which we wish 
to determine y: 


tis 4 4s 
3h + 1p" + error. \ 


yhy=1tht+ht 

The solution to our differential equation, dy/dx = x + y, y(O) = 1, is then given 
by Table 5.2. 

As we computed the last two entries, there was some doubt in our minds as to their 
accuracy without using more terms of the Taylor series, because the successive terms 
were decreasing less and less rapidly. We need more terms than we have calculated to 
get four-decimal-place accuracy. 

The error when a convergent Taylor series is truncated is simple to express. We 
merely take the next term and evaluate the derivative at the point x = 0 < €<h, 


352 CHAPTER 5: NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 


Table 5.2 


5.3 


x y y, analytical 
0 1.000 1.0000 
0.1 1.1103 1.1103 
0.2 1.2428 1.2428 
0.3 1.3997 1.3997 
0.4 1.5835 (2) 1.5836 
0.5 1.7969 (2) 1.7974 


instead of at the point x = x9. This is exactly what we did to write error terms in the 
previous chapters. The error term of the Taylor series after the A* term is 


(v) 
Error = eis, 0<é<h, 


However, this cannot be computed because evaluating the derivative at x = € is impossible 
with € unknown, and even bounding it in the interval [0, h] is impossible because the 
derivatives are known only at x = 0 and not at x = h. 

Numerical analysis is sometimes termed an art instead of a science because, in 
situations like this, the number of Taylor-series terms to be included is a matter of judgment 
and experience. We normally truncate the Taylor series when the contribution of the last 
term is negligible to the number of decimal places to which we are working. However, 
this is correct only when the succeeding terms become small rapidly enough—in some 
cases the sum of the many neglected small terms is significant. 

The Taylor series is easily applied to a higher-order equation. For example, if we 
are given 


y'=3+x-y%, yO) = 1, y' (0) = 


we can find the derivative terms in the Taylor series as follows: 
y(0), y'(0) are given by the initial conditions. 
y(0), results by substitution into the differential equation from y(0) and y’(0). 


yO), y'\(0), . . . are found by differentiating the equation for the previous order 
of derivative and substituting previously computed values. 


EULER AND MODIFIED EULER METHODS 


As we have seen, the Taylor-series method is awkward to apply if the various derivatives 
are complicated, and the error is difficult to determine. An even more significant criticism 
in this computer age is that taking derivatives of arbitrary functions cannot easily be 
written into a computer program. We look for an approach that is not subject to these 
disadvantages. 


Table 5.3 


5.3: EULER AND MODIFIED EULER METHODS 353 


One thing we do know about the Taylor series. The error will be small if the step 
size h (the interval beyond x» where we evaluate the series) is small. In fact, if it is small 
enough, only a few terms are needed for good accuracy. The Euler method may be thought 
of as following this idea to the extreme for first-order differential equations. Suppose we 
choose / small enough that we may truncate after the first-derivative term. Then 


y(%q + h) = yx) + Ay’ (xo) + YO2, Xy << E<X tA 


We have written the usual form of the error term for the truncated Taylor series. 
In using this equation, the value of y(xp) is given by the initial condition and y’ (xo) 
is evaluated from f(x, yo), given by the differential equation, dy/dx = f(x, y). It will 


of course be necessary to use this method iteratively, advancing the solution to x = 
Xo + 2h after y(vy) + A) has been found, then to x = xy + 3/, and so on. Adopting 
a subscript notation for the successive y-values and representing the error by the 
order relation, we may write the algorithm for the Euler method as 


Yat. = Yn + hy; + Oh? )error.* 


As an example, we apply this to the simple equation 


dy 
dx 


x+y, y(O) = 1, 


where the computations can be done mentally. It is convenient to arrange the work as in 
Table 5.3. Take h = 0.02. 


5 Yn Yn hy), 

0 1.0000 1.0000 0.0200 
0.02 1.0200 1,0400 0.0208 
0.04 1.0408 1.0808 0.0216 
0,06 1.0624 1.1224 0.0224 
0.08 1.0848 1.1648 0.0233 
0.10 1.1081 


(1.1103 analytical, error is 0.0022) 


*This is the order of error for one step only, the “local error.” As detailed in a later section, over many steps 
the error becomes O(/). Such accumulated error is termed the global error. 
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Figure 5.1 


Each of the y,, values is computed using Eq. (5.2), adding hy,, and y,, of the previous 
line. Comparing the last result to the analytical answer y(0.10) = 1.1103, we see that 
there is only two-decimal-place accuracy, even though we have advanced the solution 
only five steps. To gain four-decimal accuracy, we must reduce the error at least 22-fold. 
Since the global error is about proportional to h, we will need to reduce the step about 
22-fold to < 0.004. 

The trouble with this most simple method is its lack of accuracy, requiring an 
extremely small step size. Figure 5.1 suggests how we might improve this method with 
little additional effort. 

In the simple Euler method, we use the slope at the beginning of the interval, y,, to 
determine the increment to the function, but this is always wrong. After all, if the slope 
of the function were constant, the solution is an obvious linear relation. We need to use 
an average slope over the interval if we hope to estimate the change in y with precision. 

Suppose we do this, using the arithmetic average of the slopes at the beginning and 
end of the interval: 

Vie — iy or peat Yard 


n 


(5.3) 


This will surely give an improved estimate for y at x,,,,. However, we are unable 
to employ Eq. (5.3) immediately, because, since the derivative is a function of both x 
and y, we cannot evaluate y,,, , with y,, ; unknown. The modified Euler method surmounts 
the difficulty by estimating, or “predicting,” a value of y,,,; by the simple Euler relation, 
Eq. (5.2), and uses this value to compute y,,, ;, giving an improved estimate (“corrected” 
value) for y,,,. Since the value of y,,,, was computed using the predicted value, of less 
than perfect accuracy, one is tempted to recorrect the y,,,, value as many times as will 
make a significant difference. (If more than two or three recorrections are required, it is 
more efficient to reduce the step size.) 

We will illustrate the modified Euler method, which we also call the Euler predictor— 
corrector method, on the same problem previously treated. Table 5.4 is convenient. 


Hie 


mo” + y, y(0) = 1, h = 0.02. 


Analytical solution 


yi “ 
True y-value 
Yo 


| % mth 
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Xn Yn hyn Ynst Yntt Yay NY 

0 1.0000 1.0000 0.0200 1.0200 1.0400 1.0200 0.0204 
1.0204* 1.0404 1.0202 0.0204 
0,02 1.0204 1,0404 0.0208 1.04129 1.0812 1.0608 0.0212 
tid V } £ 1.0416 1.0816 1.0610 0.0212 
0.04 1.0416 1.0816 0.0216 1.0632 1.1232 1.1024 0.0220 
1.0636 1.1236 1.1026 0.0221 
1.0637 1.1237 1.1027 0.0221 
0.06 1.0637 1.1237 0.0225 1.0862 1.1662 1.1449 0.0229 
1.0866 1.1666 1.1451 0.0229 
0.08 1.0866 1.1666 0.0233 1.1099 1.2099 1.1883 0.0238 
1.1104 1.2104 1.1885 0.0238 

0.10 1.1104 


(1.1103 analytical value) 


*It is convenient to use this column for both the predicted and corrected values of y,.. The first entry in any x-row is the predicted value. Corrected 
and recorrected values follow. In this example, it was necessary to recorrect only where x = 0.04. All the other calculations indicate no need to recorrect 
the values of y,,,,. We will discuss in a later section whether one should recorrect the first corrected value of y,,., in this type of method. In general, 
it is better to make A small and not recorrect 


In this table, we tabulate the corrected values of y,,, in the same column as the 
predicted ones. y,, is the mean of y,, and the last value of y,,, ,. The answer agrees within 
1 in the fourth decimal place. We have done more work than in the simple Euler method, 
but certainly not the 22 times more that would have been needed with that method to 
attain four-decimal-place accuracy. Hence the modified method is more efficient. 

We can find the error of the modified Euler method by comparing with the Taylor 
series. 
due + ren, 


sav = de PE Se Ges Pay hs 


Replace the second derivative by the forward-difference approximation for y", 
(41 — ¥4)/h, which has error of O(h), and write the error term as O(h?): 


Wer = Jat Wg + ere + own) + OU), 


2 


1 1 
Ynt1 = In + ny, ae ys) + O(n), 


‘ Yn + Yn+1 
Ynt1 5 


ger if ) + O(h*). 


This shows that the error of one step of the modified Euler method is O(h?). This 
is the “local error.” There is an accumulation of errors from step to step. so that the error 
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5.4 


over the whole range of application, the so-called global error, is O(h7). This seems 
intuitively reasonable since the number of steps into which the interval is subdivided is 
proportional to 1/h; hence the order of error is reduced to O(h?) on the continuing 
application. We treat the accumulation of errors more fully in a later section. 


RUNGE-KUTTA METHODS 


A further advance in efficiency (that is, obtaining the most accuracy per unit of com- 
putational effort) can be secured with a group of methods due to the German mathe- 
maticians Runge and Kutta. The fourth-order Runge-Kutta methods are widely used in 
computer solutions to differential equations. The development of this technique is alge- 
braically complicated. 

To convey some idea of how the Runge-Kutta methods are developed, we show the 
derivation of a second-order method. We write the increment to y as a weighted average 
of two estimates of Ay, k, and ky. For the equation dy/dx = f(x, y), 


nei = Yn + aky + bk, 


ky = Af (Xs Yn) 
ky = hf(x, + ah, y, + Bk,). 


, 


We can think of the values k, and k, as estimates of the change in y when x advances 
by A because they are the product of the change in x and a value for the slope of the 
curve, dy/dx. The Runge-Kutta methods always use as the first estimate of Ay the simple 
Euler estimate; the other estimate is taken with x and y stepped up by the fractions a 
and B of h and of the earlier estimate of Ay, k;. Our problem is to devise a scheme of 
choosing the four parameters, a, b, a, B. We do this by making Eq. (5.4) agree as well 
as possible with the Taylor-series expansion, in which the y-derivatives are written in 
terms of f, from dy/dx = f(x, y), 


Ynt1 = Yn + Mf qe Ym) + (h7/2)F" Our Yn) #2 + ° 


An equivalent form, since df/dx = f, + f, dy/dx = f, + f,f, is 


Yai = Yn + Wh, + (Sf - ae (5.5) 


(All the derivatives in Eq. (5.5) are evaluated at the point (x,,, y,,).) We now rewrite Eq. 
(5.4) by substituting the definitions of k, and k,: 


Ynt1 = Yan + Ghf(X,, ¥,) + bhf[x, + ah, y, + Bhf(x,, ¥_)I- (5.6) 
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To make the last term of Eq. (5.6) comparable to Eq. (5.5) we expand f(x, y) in a 
Taylor series in terms of x,, y,, remembering that f is a function of two variables,* 
retaining only first derivative terms. 


FLX + ath, Yn + BAF ns Yn) = F + frak + fyBhPn- (5.7) 


On the right side of both Eqs. (5.5) and (5.7), f and its partial derivatives are all to be 
evaluated at (x,, y,). 
Substituting from Eq. (5.7) into Eq. (5.6), we have 


Yn = Yn + ahfy + BAF + frah + ABhAn» 
or, rearranging, 
Vat =¥n + (a + b)hf, + Wabi, + Bbf.f),. (5.8) 


Equation (5.8) will be identical to Eq. (5.5) if 
= = 1 = 
a+b=1, ab= 5, Bb = 


Note that there are only three equations that need to be satisfied by the four unknowns. 


We can choose one value arbitrarily (with minor restrictions); hence we have a set of 


second-order methods. For example, taking a = 3, we have b = i, e=5,p= 3. Other 
choices give other sets of parameters that agree with the Taylor-series expansion. If one 
takes a = i, the other variables are b = i, a = 1, B = 1. This last set of parameters 
gives the modified Euler algorithm that we have previously discussed; the modified Euler 
method is a special case of a second-order Runge—Kutta method. 

Fourth-order Runge-Kutta methods are most widely used and are derived in similar 
fashion. Greater complexity results from having to compare terms through /*, and gives 
a set of 11 equations in 13 unknowns. The set of 11 equations can be solved with 2 
unknowns being chosen arbitrarily. The most commonly used set of values leads to the 
algorithm 


ut 
6 


= hf (Xn Yn) 


= Yn + Zlky + ky + 2ky + ky), 


5 = flay + oh, i: ok), 


l 1 
bs = fly + 5h. Yn + 5he), 


= hf (x, + he yp, + ky). 


*Appendix A will remind readers of this expansion. 
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As an example, we again solve dy/dx = x + y, y(0) = 1, this time taking h = 0.1: 


k, =.0.1(0 + 1) = 0.10000, 
ky = 0.1(0.05 + 1.05) = 0.11000, 


ky = 0.1(0.05 + 1.055) = 0.11050, 
k, = 0.1(0.10 + 1.1105) = 0.12105, 
1 
y(0.1) = 1.0000 + 6 (0.10000 + 0.2200 + 0.22100 + 0.12105) 


= 1.11034, 


} 
} 
| 
| 
| 


This agrees to five decimals with the analytical result, and illustrates a further gain 
in accuracy with less effort than required by our example of Section 5.2. 

The local error term for the fourth-order Runge-Kutta is O(h>); the global error 
would be O(h*). It is computationally more efficient than the modified Euler method 
because, while four evaluations of the function are required per step rather than two, the 1 
steps can be manyfold larger for the same accuracy. 

It is easy to see why the Runge-Kutta technique is so popular. Since going from 
second to fourth order was so beneficial, we may wonder if we should use a still higher d 
order of formula. Higher-order (fifth, sixth, and so on) Runge—Kutta formulas have been ] 
developed and can be used to advantage in determining a suitable size, h, as we will see. t 

A standard way to determine whether the Runge-Kutta values are sufficiently accurate 
is to recompute the value at the end of each interval with the step size cut in half. If this 7 
makes a change of negligible magnitude, the results are accepted; if not, the step is halved 
again until the results are satisfactory. This is very expensive, however, because seven 
additional sets of function evaluations are made just to determine the accuracy. (We may 
not have to make a new computation after the last one to ascertain its accuracy, of course. 

The knowledge that the method is of O(h>) accuracy permits us to anticipate the ehange | 


that an additional computation would make if we have repeated once already.) There 
have been several schemes proposed to minimize the effort to determine the error in a 
Runge-Kutta computation. They demand some additional effort, but fewer than seven 
additional function evaluations, which, plus the original four, total eleven. 

The Runge—Kutta—Fehlberg method is one of the most popular of these methods at 
the present time. Six functional evaluations are required, but we then have an estimate 
of the error as well: 


ky = hh - f(Xns Yn)s 
ky = h- f(x, + h/4, y, + ky /4), 

ky = h- f(x, + 3h/8, y, + 3k,/32 + 9k,/32), 
ky = h- f(x, + 12h/13, y, + 1932k,/2197 

— 7200k2/2197 + 7296k3;/2197), 
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ks = hs f(x, + hy Yq + 439k, /216 — 8k, 
+ 3680k;/513 — 845k,/4104). 
h + fXq_ + h/2, Yn — 8k, /27 + 2ky 
3544k,/2565 + 1859k,/4104 — 11k;/40); 
= y, + (25k, /216 + 1408k,/2565 
2197k,/4104 — ks/5), with global error O(h4), 
= y, + (16k, /135 + 6656k;/12,825 + 28,561k,/56,430 
9ks/S0 + 2k,/55) with global error O(/°); 
k,/360 — 128k;/4275 — 2197ks/75,240 
+ ks/50 + 2ks/'55. 


> 
Es 
| 


3 


+ 


4 
; 


™ 
" 


The basis for the Runge—Kutta—Fehlberg scheme is to compute two Runge-Kutta esti- 
mates for the new value of ¥,,, ; but of different orders of errors. Thus, instead of comparing 
estimates of y,,, for A and h/2, we compare the estimates J,,,, and y,,,, using fourth- 
and fifth-order (global) Runge-Kutta formulas. Moreover, both equations make use of 
the same k's, so only six function evaluations are needed versus the previous eleven. In 
addition, one can increase or decrease h depending on the value of E at each step, As 
our estimate for the new y,,,, we use the fifth-order (global) estimate. 

As an example, we once more solve dy/dx = x + y, y(0) = 1, again taking h = 
0.1: 


ky = 0.10 + 1)=0.1, 

ky = 0.1(0.025 + 1.025) = 0.105, 

k, = 0.1(0.0375 + 1.0389) = 0.10764, 

ky = 0.1(0.09231 + 1.101295) = 0.119360. 
ks = 0.1(0.1 + 1.110824) = 0.121082, 

ke = 0.1(0.05 + 1.052415) = 0.110242; 


il] 


Yat) = 1 + (0.01157 + 0.05909 + 0.06390 — 0.02422) 
1.11034197, 

1 + (0.1185 + 0.05586 + 0.06041 — 0.02180 + 0.00040) 
1.110341834. 

E = 0.00000014 and the exact value is: 1.110341836! 


This agrees to eight decimal places (the actual computation was done in ten-digit arith- 
metic) with the analytical result, with only two more function evaluations. Moreover, 
we have the value F to adjust our step size for the next iteration. 

Let us summarize and compare (see Table 5.5) the four numerical methods we have 
studied for solving y’ = f(x, y). 
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Table 5.5 
aE 
“* Estimate of slope Global Local Evaluations of 
Method over x-interval error error f(x, y) per step 
Euler Initial value O(h) — Ofh?) 1 
Modified Euler Arithmetic average of initial and final predicted slope O(h7) O(h*) Zz 
Runge-Kutta (fourth-order) Weighted average of four values Oth*) O(A5) 4 
Runge—Kutta—Fehlberg Weighted average of six values Oth’) = OfA®) 6 
Table 5.6 
Number of 
Step function 
Method size Result Error evaluations 
Euler 0.02 1.1081 0.0022 a) 
Modified Euler 0.02 1.1104 0.9001 12 
Runge-Kutta (fourth-order) 0.1 1.11034 0.000001 4 
Runge—Kutta—Fehlberg 0.1 1.11034183 0.000000002 6 


In the example problem, y’ = x + y, (0) = 1, we obtained the results shown in 
Table 5.6 for the y-value at x = 0.1. 

For completeness we mention the Runge—Kutta~Merson method. This method com- 
putes the Ay in the next step; an estimate of the local error is then available from a 
weighted sum of the individual estimates: 


=h> f(Xns Yn)s 

2 = he f(x, + h/3. y, + ky/3), 

3 =h- f(x, + h/3, y_, + k,/6 + kp/6), 
=h- f(x, + h/2, yq_ + ky/8 + 3k3/8), 


=h+ fq + hy yn + y/2 — 3k3/2 + 2k,); 
Yne1 =Yn + (ky + 4ky + ks)/6 + O(H5); 


DE 


E39 


(2k, — 9ks + 8ky — ks). 


Finally, there is an IMSL subroutine DVERK that uses Runge-Kutta formulas of 
orders 5 and 6 that were developed by J. H. Verner. This method uses eight function 
evaluations to find the two estimates of y,,. ;- 


5.5 


5.5; MULTISTEP METHODS 9361 


MULTISTEP METHODS 


Runge—Kutta-type methods (which include Euler and modified Euler as special cases) 
are called single-step methods because they use only the information from the last step 
computed. In this they have the ability to perform the next step with a different step size 
and are ideal for beginning the solution where only the initial conditions are available. 
After the solution has begun, however, there is additional information available about 
the function (and its derivative) if we are wise enough to retain it in the memory of the 
computer. A multistep method is one that takes advantage of this fact. 

The principle behind a multistep method is to utilize the past values of y and/or y’ 
to construct a polynomial that approximates the derivative function. and extrapolate this 
into the next interval. Most methods use equispaced past values to make the construction 
of the polynomial easy. The Adams method is typical. The number of past points that 
are used sets the degree of the polynomial and is therefore responsible for the truncation 
error. The order of the method is equal to the power of / in the global error term of the 
formula, which is also equal to one more than the degree of the polynomial. 

To derive the relations for the Adams method. we wnite the differential equation 
dy/dx = f(x, y) in the form 


dy = f(x. y) dx. 


and we integrate between x, and x 


In order to integrate the term on the right. we approximate f(x. y) as a polynomial in x. 
deriving this by making it fit at several past points. If we use three past points, the 
approximating polynomial will be a quadratic. If we use four points. it will be a cubic. 
The more points we use. the better the accuracy (until round-off interferes. of course). 

Suppose we fit a second-degree polynomial through three past points. writing this 
as a Newton-Gregory backward polynomial: 


(n-1 [Ano + 
[Oo y= mere = (te + shana + SS e,2 + err) de 


Xn 


rs=l 5. 
[oo (tet heer + SS ha) as 
a”) é 


=! sis + 1s +2),5 
+ [SS Perens. 


In the preceding. we have changed the variable to s and identified x, as xp. The interval 
of integration becomes s = 0 to s = 1. Performing the integration, we get 


(5.9) 
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EXAMPLE 


Table 5.7 


While it is perfectly possible to use Eq. (5.9) as it stands, constructing a difference table — 
is not necessary. If we expand the differences of f in terms of the values of f,, f,-;, and 
fn—2, We get a more useful formula: 


Sn = fn-1 5 fn = fai tha 
2 12 


Ynt1 = Yn + Alt z 


= yn + K23f, — 16fe-1 + Sfy-a) + O10"). 


Observe that Eq. (5.10) resembles the single-step formulas of the previous sections in 
that the increment to y is a weighted sum of the derivatives times the step size, but differs 
in that past values are used rather than estimates in the forward direction. 


We illustrate the use of Eq. (5.10) to calculate y(0.6) for dy/dx = x + y, y(0) = 1. We 
compute good values for y(0.2) and y(0.4) using a single-step method. In this case we 
obtain these values using the Runge—Kutta—Fehlberg method with h = 0.2. These values 
are given in, Table 5.7. 


x y 


0.0 1.00000 1.00000 


0.2 1.24281 1.44281 


0.4 1.58365 1.98365 


Then from Eq. (5.10): 
A 
y(0.6) = 1.58365 + © (23(1.98365) — 16(1.44281) + 5(1.0)] = 2.04260. 
Comparing our result with the exact solution (2.04424), we find that the computed 
value has an error of 0.00164. We can reduce the size of the error by doing the calculations 
with a smaller step size of 0.1. We use the Runge—Kutta—Fehlberg method once again 


to obtain the values in Table 5.8. Using the values at x = 0.3, x = 0.4, x = 0.5 from 
Table 5.8, we repeat the computations: 


(0.6) = 1.79744 + 9" [23(2.29744) — 16(1.98364) + 5(1.69972)] = 2.04412, 


which has an error of 0.00012 compared with the exact answer of 2.04424. 


—— St 


Table 5.8 


Table 5.9 
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x y F(x, y) 

0.0 1.00000 1.00000, 
O01 1.11034 1.21034 
0.2 1.24281 1.44281 
0.3 1.39972 1.69972 
04 1.58364 1.98364, 


0.5 1.79744 2.29744 


In Section 5.7 we will see that we can reduce the error by using more past points 
for fitting the polynomial. In fact, when the derivation is done for four points using the 
Newton—Gregory backward polynomial, the following results are obtained: 


/ 1 5 
Yat = Yn + Als, + Ati + Ty LM h-2 + 3h.-3) + O(n), 


h 
Yne1 = Yn + 5g 155fn — 59fn—1 + 37-2 — na) + O(n’). 


We repeat the above example with h = 0.1 to compute y(0.6), now using the values 
in Table 5.8 for x = 0.2, x = 0.3, x = 0.4, x = 0.5, and computing 


(0.6) = 1.79744 + $F [55(2.29744) — 59(1.98364) + 37(1.69972) — 9(1.44281)] 


= 2.04423. 


The error of this computation has been reduced to 0.00001. We summarize the results 
of these two formulas in Table 5.9. 


Number of Estimate of Error 
points used (0.6) (h = 0.1) 


3 2.04412 0.00012 
4 2.04423 0.00001 
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5.6 


MILNE’S METHOD 


The method of Miine is a multistep method that first predicts a value for y,,, ; by extrap- 
olating values for the derivative. It differs from the Adams method in that it integrates 
over more than one interval. The past values that we require may have been computed 
by the Range—Kutta method, or possibly by the Taylor-series method. In the Milne 
method, we suppose that four equispaced starting values of y are known, at the points 
Xn» Xp—1+ Xp—2+ and x,,-3. We can employ quadrature formulas to integrate as follows: 


dy _ F 
ae =/;9); 
al hal | dy Zz noi , = Xn F 
fe (a) dx = ii (fx, y) dr > E PQ) de; (6.11) 


so 


4h, : ee 
Ynt1 ~ Yn-3 = hn ~ fr-1 + Fn-2) + QGPPYMED, — n-3 S $1 S Ane 


We integrate the function f(x, y) by replacing it with a quadratic interpolating poly- 
nomial that fits at the three points, where x = x,,, x, ,, and x, >, and integrating according 
to the methods of Chapter 4.* Note that we extrapolate in the integration by one panel 
both to the left and to the right of the region of fit. Hence the error is larger, because of 
the extrapolation, than it would be with only interpolation. 

With the value of y,,., one can calculate f,,,, reasonably accurately. In Milne's 
method, we use Eq. (5.11) as a predictor formula and then correct with 


fXn+1 ' Xnth Xn 
| (2) oe I Resides [ Py(x) dr, (5.12) 
Ge NOE i 


Ixy 


5 


h fe 
Ynt tee ~ Yn-1 = 3 net + 4fp + fr-1) — GY)» n= S & < Xn+1- 


In Eq. (5.12), the polynomial P; is not identical to that in Eq. (5.11) because they 
do not fit the function at the same three points. In Eq. (5.12) the polynomial fits at x,,,1, 
X,, and x,,;. Note the changed range of integration and the smaller coefficient in the 
error term of Eq. (5.12) because the polynomial is not extrapolated; f,,,, is calculated 
using y,,;, from the predictor formula. The integration formula is the familiar Simpson’s 


*The formula can also be derived by the method of undetermined coefficients, as shown in the Appendix. 


Table 5.10 
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a a 
® y SELVA 
— 1.00000 1.00000 
Taylor 1.11034 1.21034 
eines 0.2 1.24280 1.44280 
0.3 1.39971 1.69971 
0.4 (1.58364) (1.98364) Predictor value 
1.58364 1.98364 Corrector value 
0.5 (1.79742) (2.29742) Predictor value 
1.79743 2.29743 Corrector value 


Analytical value at x = 0.5 is 1.79744 


5 rule, since we integrate a quadratic over two panels within the region of fit, In this 
method we do not try to recorrect y,,, 1. 

We illustrate with our familiar simple problem, dy/dx = x + y, y(O) = 1. From 
Section 5.1, we take the four values calculated by Taylor series and carry five decimals 
(see Table 5.10). 

For this example, the first predictor and corrector values agree, and the next set differ 
by only one in the fifth decimal place. The corrected value at x = 0.5 is in error by only 
one in the fifth decimal place. This discrepancy is mostly due to round-off error in the 
computed values. In this case, the value of h could have been chosen larger. With the 
set of values available, 4 cannot be increased without additional computations, but if we 
had seven equally spaced values, we could double / by taking only every other one and 
still have four values to move ahead from. 

Normally, the values of y,,, from the predictor and the corrector do not agree, 
Consideration of the error terms of Eqs. (5.11) and (5.12) suggests that the true value 
should usually lie between the two values and closer to the corrector value. While &, and 
& are not necessarily the same value, they lie in similar intervals. If one assumes that 
the values of y‘(é,) and y*(&) are equal, the error in the corrector formula is k times 
the error in the predictor formula. Hence the difference between the predictor and corrector 
formula is about 29 times the error in the corrected value. This is frequently used as a 
criterion of accuracy for Milne’s method.* The ease with which we can monitor the error 
is a particular advantage of predictor—corrector methods. We are able to know immediately 
whether the step size is too big to give the desired degree of accuracy. This is in strong 
contrast to the Runge—Kutta method of the previous section (but not of Runge—Kutta— 
Fehlberg). 

Milne’s method is simple and has a good error term, O(h>) for local error. It is 
subject to an instability problem in certain cases, however, in which the errors do not 
tend to zero as A is made smaller. This unexpected phenomenon is discussed below. 
Because of possible instability, another method, a modification of that of Adams, is more 
widely used than Milne’s method. 


*A further criterion as to whether or not the single correction that is normally used in Milne’s method is 
adequate is discussed in Section 5.8 
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It is sufficient to show the instability of Milne’s method for one simple case. We 
will be able to draw the necessary conclusions from this. Consider the differential equation 
dy/dx = Ay, where A is a constant. The general solution is y = ce**. If y(xp) = yo is 
the initial condition that the solution must satisfy, c = ye“. Hence, letting y,, be the 
value of the function when x = x,, the analytical solution is 


Yq = yoeA Gn 70), (5.13) 
If we solve the differential equation by the method of Milne, we have, from the 
corrector formula, 
he, ee 
Yne1 = Yn-1 + 3Vnv1 + An + Yn-v- 
Letting y,, = Ay,, from the original differential equation, and rearranging, we get 


A 
Yn = Yn-1 + 3(AYn+1 + 4AYn + AYn—1)> 


hA 4hA hA 
(1 = Ayan =a (1 + yt = 0. (5.14) 


We would like to solve this equation for y,, in terms of yg to compare to Eq. (5.13). 
Equation (5.14) is a second-order linear difference equation that can be solved in a manner 
analogous to that for differential equations. The solution is 


Yn = CZ] + C2Z5, (5.15) 


where Z,, Z, are the roots of the quadratic 


hA\ 52 _ 4hA, _ hA\ _ 
(1-%)z vz (1+) <0. (5.16) 


(The reader should check that Eq. (5.15) is a solution of Eq. (5.14) by direct substitution.) 
For simplification, let hA/3 = r; the roots of Eq. (5.16) are then 


2r Var tal 


A alist 

ar — V3F +1 
\ 5. 
Z, T=5 (5.17) 


We are interested in comparing the behavior of Eqs. (5.15) and (5.13) as the step 
size h becomes small. As h > 0, r > 0, and r2 — 0 even faster. Neglecting the 37? 
terms in comparison to the constant 1 under the radical in Eq. (5.17) gives 


5 
2, = 221 = 1 +34 0) = 1+ ah + OF), 

or — 
%=4%=1 = -14+r+00)= -(1 - 4) + O(W2). (5.18) 


5.7 
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The last results are obtained by dividing the fractions. We now compare Eq. (5.18) 
with the Maclaurin series of the exponential function, 


eta 


1+ AA + O(h2), 


e has 


= ‘ + O(h?). 


We see that, for h > 0, 


Z = e ha, Z = e ha, 


Hence the Milne solution is represented by 


In Eq. (5.19), we have used x, — x9 = nh. The solution consists of two parts. The 
first term obviously matches with the analytical solution, Eq. (5 The second term, 
called a parasitic term, will die out as x, increases if A is a positive constant, but if A 
is negative, it will grow exponentially with x,. Note that we get this peculiar behavior 
independent of h; smaller step size is of no benefit in eliminating the error. 

A numerical example that demonstrates the instability of Milne’s method is given in 
the next section. Such a demonstration by a numerical example is less conclusive than 
the analytical approach just given, but is much easier to grasp. 


ADAMS-MOULTON METHOD 


A method that does not have the same instability problem as the Milne method, but is 
about as efficient, is the Adams—Moulton method. It also assumes a set of starting values 
already calculated by some other technique. Here we take a cubic through four points, 
from x,,3 to x, and integrate over one step, from x, to x,,,,. This is the same as the 
method described in Section 5.5. For dy/dx = f(x, y), 


Ene dy _ Xn _ n+ 
[ (3) ax = [. f(x, y) dx = I. P3(x) dx, 


Es 251 


1 5 gs 
Int — Yn = Wh, + ATi + TM hi-2 + B® =) + phy), 


Xn-3 < F< Apa (5.20) 


Table 5.11 
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The integration formula can be derived by the methods of Chapter 4 or by the method 
of undetermined coefficients. Alternatively. it can be developed by writing P(x) as a 
Newton—Gregory backward interpolating polynomial (fitting at x,. x, 1, X,—2, and x,_3) 
and integrating. In the Adams—Moulton method we continue by correcting y,.,; before 
calculating the next step. Using Eq. (5.20) as a predictor formula, we can compute a 
nearly correct value of f,.,. If we now approximate f(x, y) as a cubic that fits over the 
range from x,_> [0 x,-;, and integrate from x, to x,.;. we will not be extrapolating the 
polynomial, and will have a more favorable error term. The result is 


1 19 


vert ~ Yn = Al fees ~ 38K — BSR — HMK-s) - Bev), 


Ha-2 < & < Inet (5.21) 


The Adams—Moulton method consists of using Eq. (5.20) as a predictor and Eq- 
(5.21) as a corrector. We illustrate it with the same example as before, dy/dx = x + y, 
y(0) = 1. A difference table (Table 5.11) is computed to assist us (though in a computer 
program we would use a different scheme: see Eqs. (5.22)). 

By the predictor formula, 


y(0.4) = 1.39971 + o.1{ 1.69971 + 4 (0.25691) 2s F10.00445) ‘a 310.0023] 
a~ 


1.58363. =, 


ll 


x y f Af “f Sf 
0 1.00000 1.00000 
0.21034 
0.1 1.11034 1.21034 0.02212 
Starting 0.23246 0.00233 
values 0.2 1.24280 1.44280 0.02445 
0.25691 (0.00256) 
0.3 1.39971 1.69971 (0.02701) 0.00257 
‘ (0.28392) 0.02702 (0.00283) 
Predicted 0.4 (1.58363) (1.98363) 0.28393 (0.02985) 
Corrected 1.58364 1.98364 (0.31378) 
Predicted 0.5 (1.79742) (2.29742) 
Corrected 1.79743 


Analytical value at x = 0.5 is 1.79744. 
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Then f at x = 0.4 is computed and the difference table is calculated. The corrector formula 
then gives 


(0.4) = 1.39971 + o.1[ 1.98363 = (0.28392) sae 


1 
2 72 (0.02701) = 3,00.00256)| 


= 1.58364. 


The computations are continued in the same manner to get (0.5). The corrected value 
almost agrees to five decimals with the predicted value. Comparing error terms of Eqs. 
(5.20) and (5.21), and assuming that the two fifth-derivative values are equal, we see 
that the true value should lie between the predicted and corrected values, with the error 
in the corrected value being about 


9 7 
BI+19 ™ 14, 


to 


times the difference between the predicted and corrected values. A frequently used criterion 
for accuracy of the Adams—Moulton method with four starting values is that the corrected 
value is not in error by more than | in the last place if the difference between predicted 
and corrected values is less than 14 in the last decimal place.* If this degree of accuracy 
is inadequate. we know that h is too large. 

Equations (5.20) and (5.21) are not well suited to computer utilization because 
calculating and storing the difference tables is wasteful of both time and memory space. 
Each of these differences is a linear function of the various f-values. however. Expressing 
the differences in terms of the f’s gives an alternative form for the Adams—Moulton 
method that is usually employed in a computer program. 


h 
= Yn + qf — 3-1 + 37-2 — Yn-a)- 


h 
= Ya + 34(%fn~1 + 19f, — Shai + fr-2)- 


These same formulas are derived directly by the method of undetermined coefficients as 
Eqs. (B.28) and (B.29) of Appendix B. 

These equations are better suited to machine calculation and to digital computer 
programming. Without such calculation aids. the large coefficients and alternating signs 
lead to large round-off errors unless extra guard figures are carried. 

Adams—Moulton formulas employing more or less than four starting values can be 
derived in analogous fashion. The fourth-order formulas we have given are widely used, 
especially in combination with Runge-Kutta, because both kinds of methods then have 
local errors of O(h*). 


*The convergence criterion of Section 5.8 should also be met. 
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When the predicted and corrected values agree to as many decimals as the desired 
accuracy, we can save computational effort by increasing the step size. As mentioned in 
connection with Milne’s method, we can conveniently double the step size, after we have 
seven equispaced values, by omitting every second one. When the difference between 
predicted and corrected values reaches or exceeds the accuracy criterion, we should 
decrease step size. If we interpolate two additional y-values with a fourth-degree poly- 
nomial, where the error will be O(h*), consistent with the rest of our work, we can readily 
halve the step size.* Convenient formulas for this are 


1 
Yn—1/2 = Tagl35Yn + 140y,—1 — 70yn—2 + 28yn—3 — 5Yn—al. 

1 
Yn-32 = Gl Yn + 24yn—1 + S4Yq—2 — 16Yn—3 + 3Yn-a]- (5.23) 


Use of these values with y,, y,—, gives four values of the function at intervals of 
Ax = h/2. 

The efficiency of both the Adams—Moulton and Milne methods is about twice that 
of the Runge—Kutta—Fehlberg and Runge—Kutta methods. Only two function evaluations 
are needed per step for the former methods, while six or four are required with the single- 
step alternatives. All have similar error terms. Change of step size with the multistep 
methods is considerably more awkward, however. 

We have stressed that the advantage of the Adams—Moulton method over that of 
Milne is that it is a stable method rather than an unstable one. An analytical proof of the 
stability of Adams—Moulton for the equation y’ = Ay is similar to the analysis of Section 
5.6—an equation results with parasitic terms that do die out as h gets small regardless 
of the sign of A. Such a treatment is not entirely satisfying because we can’t prove stability 
by examining only one case (proving that an assertion is nor true is always easier because 
we need to find only one counterexample). 

It is much clearer to compare the stability of Adams—Moulton with Milne through 
a numerical example. The table in Fig. 5.2 presents results from a computer program 
that solved y’ = —10y, (0) = 1 (for which the analytical solution is y = e~!°*) over 
the interval x = 0 to x = 2. The first part of the table is computed with A = 0.04. In 
the second part, for which only partial output is given, h = 0.004. 

Observe that the results with the Milne method grow to have very large relative 
errors. In fact, near x = 2.0, the relative error is practically 100%. The oscillatory 
behavior of the solution is also characteristic of instability. Even with the smaller A, the 
Milne method blows up. The errors are even greater near x = 2.0 in spite of the smaller 
step size. (The Milne solution when x is small is considerably more accurate when h = 
0.004. Results are not shown for this.) 

The results by the Adams—Moulton method do not show this anomalous behavior. 
The relative error, while growing to some degree, still stays manageable (8% at x = 2.0 
with h = 0.04, < 0.1% at x = 2.0 with h = 0.004). The expected decrease of error 


“An alternative but computationally more expensive way to get intermediate points would be to compute them 
with Runge-Kutta formulas, 


Figure 5.2 
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em 
Solution Using Adams— 


Solution Using Milne Method Moulton Method 
Values with h = 0.04 
x y Error BY Error 

0,000000E 00 0.100000E 01 0.000000E 00  0.100000E 01 — 0.000000E 00 
0.400000E — 01 0.670320E 00 0.000000E 00  0,670320E 00  0.000000E 00 
0, 800000E— 01 0.449329E 00 0,000000E 00 — 0.449329E 00 — 0.000000E 00 
0.120000E 00 0.301195E 00 0,000000E 00  0.301195E 00 — 0.000000E 00 
0.160000E 00 0.201667E 00 0.229836E-03 = 0.201552E 00 —-0.344634E—03 
0.200000E 00 0.135271E 00 0.640750E—04 = 0.134873E 00 —0.462234E—03 
0,240000E 00 0.904572E-01 0.260949E—03 0.902753E-01 —0.442863E—03 
0,280000E 00 0.607594E—-01 0.507981E—04 — 0.604129E—-01 —-0.397373E—03 
0.320000E 00 0.405497E-O1 0.212613E—03 —-0.404273E—-01 —-0.335068E—03 


0.360000E 00 0.273069E—01 0.169165E—04 —0.270547E-01 (0.2691 19E—03 
0.400000E 00 0.181S99E—01 0.15579SE—03 0, 1810S2E—01 —-0.210475E—03 
0,440000E 00 0.122874E-01 —0.100434E-04 0.121160E-01 —0.161357E—03 
0.480000E 00 0.811849E—02 0.111297E—03  —-0.810813E—02 —-0.121657E—03 
0.520000E 0.554193E—02 —0,253431E—04 = 0,542601E—02 —-0.905804E—04 


00 
0,S560000E 00 0,361742E-02 0.804588E—04 = 0.363111E-02 — 0.667726E—04 
0.599999E 00 0,251041E-02  —0.316396E—04 —0.242996E—02 —-0.488090E—04 
0.639999E 00 0.160174E—02 0.598307E—04 —0.162614E-02 —-0.354277E—-04 
0.679999E 00 0.114626E—-02  —0.324817E—-04  0.108822E—-02 —_-0.255615E—04 
0.719999E 00 0.700667E—03 0.459237E—04 = -0,.728243E—03 —-0.183482E—04 
0.759999E 00 0.530960E—03 = —0.30S0SOE—04 —-0.487344E-03 0.131 114E—04 
0,.799999E 00 0.299215SE—03 0,362506E—04 —0.326133E—03 —-0.933232E—05 
0,839999E 00 0.252197E—03 = —0.273281E—04 = 0.218250E—03 —-0.661935E—05 
0.879999E 00 0.121497E—03 0.292376E—04 = 0.146054E—-03 —-0.4680S2E—05 
0.919999E 00 0.124879E—03 383E—-04  0.977399E—-04 —0.330036E—05 


0,959999E 00 0,437888E—04 .23940SE—04 0.654081 E—04 .232129E—05 
0,999999E 00 0.658748E—04  —0.204745E—04 = 0.437714E—04 ~—-0.162897E-05 
0.104000E 01 0.106330E—04 0.197998E—04 = 0.292920E—-04 —-0.114074E—05 
0.108000E 01 0.378257E—04 =—0.174260E—04 —-0.196024E-04 0.7973 14E—06 
0.112000E 01 —0.280455E—05 0.164789E—04 = 0.131180E—04 — 0.556327E—06 
0.116000E 01 9183E—-04 —0,147521E—-04 = 0.877864E—-05 —-0.387549E—06 
0.120000E 01 —0.762395E—05 0.137682E—04 = 0,587471E—05 —_-0.269573E—06 
0.124000E 01 0.165678E—04 =—0.124492E-04 = 0.393138E—-0S —-0..187252E—06 
0.128000E 01 —0.876950E—05 0.115303E—04 = 0.263090E—05 —-0.129902E—06 
0.132000E 01 0.123371E—04 = —0.104865E—04 = 0.176061E—05 —-0.900100E—07 
0,136000E 01 —0.842895E—05 0.966946E—05 —0.117821E-05 ——-0.622986E—07 
0.140000E O01 0.965523E—05  —0.882369E—-05 —0.788466E—06 ~—-0.430742E—07 
0.144000E 01 —0.755810E—05 O.811SSOE—05 = -0.527645E—06 =: 0.297529E—07 
0.148000E O01 0.779361E—05 —0.741997E—05 = 0.353103E—06 —-0.20532SE—07 
0.152000E 01 —0.656401E—05 0.681446E—-05 —0.236298E—06 0.141571 E—07 


Comparison of relative error in results obtained by the Milne and 
Adams—Moulton methods. 
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Figure 5.2 


(continued) 
TE _ 
Solution Using Adams— 
Solution Using Milne Method Moulton Method 


Values with h = 0.04 continued 
x y Error y Error 


0.1S6000E 01 0.640522E—05 —0.623733E—05 0.158132E—06 0.975331E—08 
0.160000E 01 —0.561102E—05 0.572355E—05 0.105823E—06  0.671417E—08 
0.164000E O01 0.53175SE—05  —0.524211E—05 0,708171E—07 —0.461910E—08 
0.168000E 01 —0.475746E—05 0.480803E—05  0.473911E—07 —-0.317556E—08 


0,172000E 01 0.443906E—05  —0.440517E-05 0.317144E—07 (0.2181 19E—08 
0.176000E 01 —0.401659E—05 0.403931E—05 —0.212234E-07 —-0.149759E—08 
0.180000E 01 0.371683E—05  —0.370160E—05  0.142028E-07 —_-0.102763E—08 
0.184000E O01 —0.338345E—05 0.339366E—05 0.950460E—08 —0.704578E—09 
0.188000E 01 0.311713E—-05 —0.311028E—05 —0.636053E—08 —-0.482931E—09 
0.192000E 01 —0.284671E—05 0.285129E—05 —0.425650E—08 —-0.330768E—09 
0.196000E O01 0.261645E—05  —0.261337E—05 —0.284847E—08 —_-0.226487E—09 
0.200000E 01 —0.239358E—05 0.239564E—05 —0.190621E—08 —-0.15S008E—09 


Values Near x = 0.2 with h = 0.004 


O.191192E O01 —0.734619E—05 0.735117E—05  0.49689SE—08 —0.42561SE—11 


O.191S92E O1 0.744385E-05 = —0.743908E—-05_—-0.477411E—08 —-0.405009E—11 
0,191992E O1 —0.752346E—05 0.75280SE—05 0.458691E-08 —0.392220E—11 
0.192392E O01 0.762249E—05  —0.761808E—-05 0.440706E—08 —-0.379785SE—11 
0.192792E O01 —0.770495E—05 0.770919E—05 —0.42342S5E—08 —0.368061E—11 
0.193192E O01 0.780545E—05 —0.780138E—05 —0.406822E—08 —-0.350298E—11 
0.193592E O01 —0.789078E—05 0.789469E—05 —0.390870E—08 —-0.339284E—11 
0.193992E O1 0.799286E—-05  —0.798910E—-05 —0.375544E—-08 —-0.328626E—11 
0.194392E O01 —0.808104E—05 0.808465E—05 0.360819E—-08 —-0.318456E—11 
0.194792E 01 0.818480E—05  —0.818133E-05 0.346671E-08 —0.303002E—11 
0.195192E 01 —0.827585E—05 0.827918E—05 0.333078E—08 —-0.293365E~—11 
0,195592E O01 0.838139E—05 = —0.837819E—-05 —-0,320017E—08 —-0.284017E—11 
0.195992E O01 —0.847532E—05 0.847839E—05 —0.307469E-08 = 0.274958E—11 
0.196392E O01 0.858274E-05 —0.857979E-05 —-0.295413E-08 —-0.261657E—11 
0.196792E O1 —0.867956E—05 0.868240E—05 0.283830E-08 = 0.253331E—11 
0.197192E O01 0.878896E—05  —0.878623E-05 0.272701E—08 = 0.245226E-—11 
0,197592E O01 —0.888869E—05 0.889132E—05 0.262008E—08 —-0.237388E—11 


0.197992E O01 0.900017E—05  —0.899765E—-05  0.251735E-08 —0.225930E—11 
0.198392E O01 —0.910284E—05 0.910526E—05 0.241864E—08 = 0.218714E—11 
0.198792E O1 0.921648E—-05  —0.92141SE—05 —0.232380E—08 —-0.211720E—11 
0.199192E O01 —0.932212E—-05  . 0.932435E—-05 0.223268E-08  0.204925E—11 
0.199592E O01 0.943801E—05 —0.943586E—-05  0.214514E—-08 —-0.195066E—11 
0.199992E 01 —0.95466SE—05 0.954871E—05 —0.206103E—08 —0.18880SE—11 
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with decrease in step size is realized. The oscillation of values we notice in the results 
by Milne is absent. In sum, we conclude that the Adams—Moulton method gives good 
results, particularly at the smaller value of h, while the results of the Milne method are 
hopeless. 

It is worth remarking that we usually don’t have analytical answers to compare with 
our numerical results. Observing oscillatory behavior in itself does not mean instability, 
because the correct solution may be oscillatory. For this reason, the usual practice in 
numerical analysis is to entirely avoid methods that are sometimes unstable, even though 
they might be more accurate in some instances. 


CONVERGENCE CRITERIA 


In Section 5.3, we recorrected in the modified Euler method until no further change in 
Yn+1 fesulted. Usually this requires one more calculation than would otherwise be needed 
if we could predict whether the recorrection would make a significant change. In the 
methods of Milne and Adams—Moulton, we usually do not recorrect, but use a value of 
A small enough that this is unnecessary. We now look for a criterion to show how small 
h should be in the Adams—Moulton method, for dy/dx = f(x, y), so that recorrections 
are not necessary. Let 


value of y,,,, from predictor formula, 


value of y,. , from corrector formula, 
values of y,., if successive recorrections are made, 


value to which successive recorrections converge, 


The change of y, by recorrecting would be 


h 
Yee — (» + 54 Ove + 19yq — Syn—1 + ¥i-2) 


h 
- (vn + 5495 + 199, — Sy + vi-2)) (5.24) 


In Eq. (5.24) we have used the subscript p or c to denote which y-value is used in 
evaluating the derivative at x = x,,,. We now manipulate the difference (y. — y,): 
7 Gne1s Yo) — FO nevis 3 
Y= Yh = Pltne rs Yo) — flips yp) = Las 2 Lene i) 

Yp) 
= f(&)D, &, between y, and y,, with D = y, — Yps 


(ve ~ yp) 
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Hence, 
9hD 
ac Yee — Ye = “aq AE) 
is the difference on recorrecting. If recorrected again, the result is 
Sh, ' 
Yece ~ Yee = 74 Yee — od 
9h 
= 54 5&2) * Yee — Ye) 
9h 9hD 
= Bite| FP HE (5.25) 
_ (kV? 2 
= (54) WOPD, — & between y, and y-<. 
In Eq. (5.25), we need to impose the restrictions that f be continuous and f,(&) have 
the same sign; € lies between the extremes of y,, y., and Yc. 


On further recorrections we will have a similar relation. We get y.. by adding all the 
corrections of y, together: 


Yoo = Vp + (Yo — Yp) + (Yoo — Ye) + (Yece = Yee) + ** 


MLOy + (HLOY'n + (MED) +... 


=7t Ds 


The increment to y, is a geometric series; so, if the ratio is less than unity, which is 
necessary if a geometric series is to have a sum, 
D — If) 


Ye = 9p F T=)? Coa € between y, and y... 


Hence, unless 


AACE) — ALAC Yn)! 
Irl=“3a79 =~ 347g <b 


the successive recorrections diverge. Our first convergence criterion is 


24/9 
at EE (5.26) 
”<ThGm Yl 


If we wish to have y, and y.. the same to within one in the Nth decimal place, then 


D rD 4 
Ya Ye= (99 + 725) — Op += PE, < 10 as 


If r < 1, the fraction 
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and a second convergence criterion, which ensures that the first corrected value is adequate 
(that is, it will not be changed in the Nth decimal place by further corrections) is 


D-10%< IH 24/9) (5.27) 


r| AAG Yl 


For the Adams—Moulton method we have the three criteria given below. If all are 
met, the corrected value should be good to N decimals: 


24/9 
| en 


24/9. 
Al fnl’ 
Accuracy criterion: D - 10% < 14.2. 


Convergence criteria: 


D:10%< 


Similar criteria for the Milne method are derived in the same way. They are 


3 
| <p 


Convergence criteria: 3 


mA 


Accuracy criterion: D - 10‘ < 29. (5.28) 


D:10N< 


These criteria are for a single first-order equation only. A similar analysis for a system 
is much more complicated. 

We illustrate the use of these criteria with an example. Given the equation 
dy/dx = sin x — 3y, y(O) = 1. In the neighborhood of the point (1, 0.3), what maximum 
value of / is permitted if we wish to compute by (1) Milne’s method, (2) the Adams— 
Moulton method and get accuracy to five decimals? How close must the predictor and 
corrector values be? 


3 
l f= "3, $0 a = 
to ensure convergence if we were to recorrect y, in the Milne method. This require- 
ment is not severe—we certainly would choose Ax smaller than this to give infor- 
mation throughout the range of x values from 0 to 1. Suppose A were taken as 0.2. 


Then 3 
= 5.0 x 10°. 


3 
DP <7]R] 10" ~— 0.2|=3] 10° 
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The difference between y, and y, cannot exceed this value; if it does, recorrections 
will be needed. This is more severe than D < 29 x 105 required for accuracy to 
one in the fifth.decimal for y,. We should monitor a Milne program in order to be 
sure that the difference between y, and y, does not exceed 5 x 10-5. If it should 
exceed this value, we would need to reduce h so that this criterion could be met. 


2. For Adams—Moulton, we calculate 


24/9 _ 
h < Ty] = 0.89. 
If h = 0.2, 
24/9 x “5 
D< 02|-3] 10° ~ 4.4 x 107°. 


Again, this difference in y, and y, is more severe than D < 14.2 1075 and should 
be used to control the value of h. 


5.9 ERRORS AND ERROR PROPAGATION 


Our previous error analyses have examined the error of a single step only, the so-called 
local truncation error of the methods. Since all practical applications of numerical methods 
to differential equations involve many steps, the accumulation of these errors, termed the 
global truncation error, is important. We remember that there are several sources of error 
in a numerical calculation in addition to the truncation error. 


ORIGINAL DATA ERRORS 


If the initial conditions are not known exactly (or must be expressed inexactly as a 
terminated decimal number), the solution will be affected to a greater or lesser degree 
depending on the sensitivity of the equation. Highly sensitive equations are said to be 
subject to inherent instability. 


ROUND-OFF ERRORS 


Since we can carry only a finite number of decimal places, our computations are subject 
to inaccuracy from this source, no matter whether we round or whether we chop off. 
Carrying more decimal places in the intermediate calculations than we require in the final 
answer is the normal practice to minimize this, but in lengthy calculations this is a source 
of error that is extremely difficult to analyze and control. Furthermore, in a computer 
program, if we use double precision, we require a longer execution time and also more 
storage to hold the more precise values. If these values are for a large array, the memory 
space needed may exceed that available to the program. This type of error is especially 
acute when two nearly equal quantities are subtracted. Both floating-point and fixed-point 
calculations in computers are subject to round-off errors. 
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TRUNCATION ERRORS OF THE METHOD 


These are the types of error we have been discussing, because we use truncated series 
for approximation in our work, when an infinite series is needed for exactness. The choice 
of method is our best control here, with suitable selection of h. 

In addition to these three types of error, when we solve differential equations numer- 
ically we must worry about the propagation of previous errors through the subsequent 
steps. Since we use the end values at each step as the starting values for the next one, 
it is as if incorrect original data were distorting the later values. (Round-off would almost 
always produce error even if our method were exact.) This effect we now examine, but 
only for the very simple case of the Euler method. This will show how such error studies 
are made, as well as suggesting how difficult the analysis of more practical methods is. 

We consider the first-order equation dy/dx = f(x, y), y(xo) = yo. Let 


Y, = calculated value at x, 


Y, = true value at x,,, 


— ¥, = error in ¥,; 


By the Euler algorithm, 
Yea = Ya + hfe, Yp)- 


By Taylor series, 
2 


. ae 
Yat = Yn + AFC Yd + FIV Eds Xn < Ey Sy + A, 


; ; , aie 
nti = Yat — Maen = Yn — Ya + ASC Yn) — LO Hd) + Fy"ED 
Xn» Yn) — On Yn ay 
Sb. bh Ln» Yn) Sta, dy, -¥)4 us y"&) 
Yn —%, 2 
he 
='e_ FHS Mien t Sy"). 7, between y,,, ¥,,- (5.29) 


In Eq. (5.29) we have used the mean-value theorem, imposing continuity and exis- 
tence conditions on f(x, y) and f,. We suppose, in addition, that the magnitude of /, is 
bounded by the positive constant K in the region of x, y-space in which we are interested. * 
Hence, 


Lg 
ent 1 SCL + AKYe, + Shey"). (5.30) 


Here y(xo) = Yo is our initial condition, which we assume free of error. Since Yo = yo, 
€y = 0: 


Lo ag 2 
e; S (1 + hKjeg + Sh’ y"(&) = dy"), 
igs sg : 
e<(1+ wo Srey | + diy") = HPL + hKYy"(&) + y"ED). 


*This is essentially the same as the Lipschitz condition, which will guarantee existence and uniqueness of a 
solution. 
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Similarly, 
en < SHU + KY") + (1 + AkDy"(E) + YE], 
e, < FeCl + AK)" "y"(&) + (1 + AK)" 2y"(&) + + + (EDI. 


If f, < K is positive, the truncation error at every step is propagated to every later 
step after being amplified by the factor (1 + Af,) each time. Note that as h — 0, the 
error at any point is just the sum of all the previous errors. If the f, are negative and of 
magnitude such that |hf,| < 2, the errors are propagated with diminishing effect. 

We now show that the accumulated error after n steps is O(h); that is, the global 
error of the simple Euler method is O(h). We assume, in addition to the above, that y’ 
is bounded, |y"(x)| <M,M > 0. Equation (5.30) becomes, after taking absolute values, 


1 
lenvsl < (1 + AK)leq| + 342M. 


Compare to the second-order difference equation 


Zune, = (1 + AK)Z, + Liem, (5.31a) 


Obviously the values of Z,, are at least equal to the magnitudes of |e,,|. The solution to 
Eg. (5.3la) is (check by direct substitution) 


AM AM 
Z, = Bgl +HK)" = 5e- 


The Maclaurin expansion of e is 


2 3 
(ak? KP 


eK =1+hK + 5) 6 


so that 
1+hK<ek (K>0), 


AM | nkyn — RM _ AM nk _ 
DS ENS ox = 9 ( D) 


= MM (eie-n0K — 1) = OH). (5.31) 


It follows that the global error e, is O(h). (This result can be derived without difference 
equations. See Exercise 42.) 
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SYSTEMS OF EQUATIONS AND HIGHER-ORDER EQUATIONS 


We have so far treated only the case of a first-order differential equation. Most differential 
equations that are the mathematical model for a physical problem are of higher order, or 
even a set of simultaneous higher-order differential equations. For example, 
2 
Wes 4 oe + kx =f, 0) 

Tepresents a vibrating system in which a linear spring with spring constant & restores a 
displaced mass of weight w against a resisting force whose resistance is b times the 
velocity. The function f(x, t) is an external forcing function acting on the mass. 

An analogous second-order equation describes the flow of electricity in a circuit 
containing inductance, capacitance, and resistance. The external forcing function in this 
case represents the applied electromotive force. Compound spring—mass systems and 
electrical networks can be simulated by a system of such second-order equations 

We first show how a higher-order differential equation can be reduced to a system 
of simultaneous first-order equations. We then show that these can be solved by an 
application of the methods previously studied. We treat here initial-value problems only, 
for which n values of the functions or derivatives (with n equal to the order of the system) 
are all specified at the same (initial) value of the independent variable. When some of 
the conditions are specified at one value of the independent variable and others at a second 
value, we call it a boundary-value problem. Methods of solving these are discussed in 
the next chapter. 

By solving for the second derivative, we can normally express a second-order equation 
as 


da; ( d: ~ 
ra =f t; is a X(fo) = Xp, -X' (fo) = XG. (5.32) 


The initial value of the function x and its derivative are generally specified. We 
convert this to a pair of first-order equations by the simple expedient of defining the 
derivative as-a second function. Then, since d?x/dt? = (d/dt)(dx/dt). 


=y, Xx(to) = X, 
aE = f(t, x,y). — W(to) = x9. 


This pair of first-order equations is equivalent to the original Eq. (5.32). For even 
higher orders, each of the lower derivatives is defined as a new function, giving a set of 
n first-order equations that correspond to an nth-order differential equation. For a system 
of higher-order equations, each is similarly converted, so that a larger set of first-order 
equations results. Thus the nth-order differential equation 


y = f(x,y, y',-.- yD), 


y(%) = A;. — -y'(%q) = Ad. see y 9" YG) = A, 
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is converted into a system of n first-order differential equations by letting y, = y and 


“91 =a 
= I3 
Yn-1=Yn 
Yn =X, Yi, Y2. = ~~ + Yn) with initial conditions 
Yi%) =A, — Y2l%0) = Az + + + + Yn) = Ane 


We now illustrate the application of the various methods to the pair of first-order 
equations 


=xt+t, x(0) = 1, 


Os (0) = - 
ats yO) = —1. 


TAYLOR-SERIES METHOD 
We need the various derivatives x’, x”, x", ., sy DP as ani all evaluated at 
t=0: 
x’ =x +1, x'(0) = (\(-1) + 0 = -1, 
y =nt+x, y'@) = O-1) + 1= 1, 
HS xy xy 1; x"(0) = (1)(1) + (—1)(-1) + : 1 = 3, 
yr=ytn' tx’, y"(0) = —1 + (O01) — 1 = -2, 
x=xly' tay’ try tx’, 20) =-7, 
yi =yl ty! + oy +2", y"0) = 5, 
and so on; and so on; 


By ce ee en ae 
xi ait b+ 5t gt t+ at mo’ + ’ 


5 13 47 
Pe eee Be NESE eee eel oid see 
yn) l+r-P+ 6 ra! + 120! + (5.34) 
Att = 0.1, x = 0.9139 and y = —0.9092. 
Equations (5.34) are the solution to the set (5.33). Note that we need to alternate 
between the functions in getting the derivatives; for example, we cannot get x’(0) until 
y'(0) is known; we cannot get y”(0) until x"(0) is known. After we have obtained the 
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coefficients of the Taylor-series expansions in Eq. (5.34), we can evaluate x and y at any 
value of #, but the error will depend on how many terms we employ. 


EULER PREDICTOR-CORRECTOR 


We apply the predictor to each equation; then the corrector can be used, Again note that 
we work alternately with the two functions. 

Take h = 0.1. Let p and c subscripts indicate predicted and corrected values, 
respectively: 


x,(0.1) = 1 + O.1[((-1) + 0) = 0.9, 
yp(0.1) = —1 + 0.101) + 1] = -0.9, 
= .9)(—0.9) + 0. 
x.(0.1) = 1+ 0.1(—1 #109109) + 0.11) = 0.9145, 
+ 10.1(-0.9) + 0. 
y.0.1) = -1 + 01(E# 10.09 + 09151) . _o soe, 


In computing x,(0.1), we used the x, and y,. In computing y,(0.1) after x.(0.1) is 
known, we have a choice between x, and x,. There is an intuitive feel that one should 
use x,, with the idea that one should always use the best available values. This does not 
always expedite convergence, probably due to compensating errors. Here we have used 


the best values to date. Recorrecting in the obvious manner gives 


x(0.1) = 0.9135, 
y(0.1) = —0.9089. 


We can now advance the solution another step if desired, by using the computed values 
at ¢ = 0.1 as the starting values. From this point we can advance one more step, and so 
on for any value of r. The errors will be the combination of local truncation error at each 
step plus the propagated error resulting from the use of inexact starting values. 


RUNGE-KUTTA-FEHLBERG 


Again there is an alternation between the x and y calculations. In applying this method, 
one always uses the previous k-value in incrementing the function values and the value 
of A to increment the independent variable. As in the previous calculations, we oscillate 
between computations for x and for y; for example, we do k, ,, then k, ,, before doing 
ky», and so on. 

Keeping in mind that the equations are 


dx 
a  SGE MH ath x(0) = 1, 
® = oti. 2, mip dx: yO) = —1, 


dt 


the k-values for x and y are 
for x: 
ky. = hf, 1, -1) 
= 0.1[)(-1) + 0) 
= -0.1; 
ky» = hf(0.025, 0.975, —0.975) 
= 0.1[(0.975)(—0.975) + 0.025] 
= —0.092562; 
ks, = hf(0.038, 0.965, —0.964) 
= 0.1[(0.965)(—0.964) + 0.038] 
= —0.089226; 
ky, = hf (0.092, 0.919, —0.915) 
= 0.1[(0.919)(—0.915) + 0.092] 
= —0.074892; 


ks. = hf(0.1, 0.913, —0.908) 
= 0.1[(0.913)(—0.908) + 0.1] 
= —0.072904; 


ke. = hf(0.05, 0.954, —0.953) 
= 0.1[(0.954)(—0.953) + 0.05] 
= —0.085868. 


Then using the fifth-order formula, we get 
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for y: 


ky y =hg(O, 1, —1) 
= 0.1{(0(—-1) + 1] 
= 0.1; 
ky = hg(0.025, 0.975, —0.975) 
= 0.1[(0.025)(—0.975) + 0.975] 
= 0.095062; 


ks. = hg(0.038, 0.965, —0.964) 
= 0.1{(0.038)(—0.964) + 0.095] 
= 0.092845; 

kg = hg(0.092, 0.919, -0.915) 
= 0.1[(0.092)(—0.915) + 0.919} 
= 0.083461; 


ks, = hg(0.1, 0.913, —0.908) 
= 0.1[(0.1)(—0.908) + 0.913] 
= 0.082178; 


il] 


hg(0.05, 0.954, —0.953) 
0.1[(0.05)(—0.953) + 0,954] 
0.090628. 


key 


1 + (—0.01185 — 0.046307 — 0.037905 + 0.013123 — 0.003122) 


0.913936; 


—1 + (0.01185 + 0.048185 + 0.042242 — 0.014792 + 0.003296) 
—0.909217. 


Extending the Taylor-series solution even further shows that the Runge—Kutta— 
Fehlberg values are correct to more than five decimals, while the modified Euler values 
are correct to only three, so h = 0.1 may be too large for that method. 

Advancing the solution by the Runge—Kutta—Fehlberg method will again involve 
using the computed values of x and y as the initial values for another step. The errors 
here will be much less than those for the Euler predictor—corrector method. 


Table 5.12 
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ADAMS-MOULTON 


After getting four starting values, we proceed with the algorithm of Eq. (5.22), again 
alternately computing x and then y (see Table 5.12). 
In the computations we first get predicted values of x and y: 


x(0.1) = 0.9330 + 0.029 155(— 0.7929) — 59(—0.8582) + 37(—0.9271) — 9(—1.0)] 
= 0.913937; 

V0.1) = — 59(0.9060) + 37(0.9515) — 9(1.0)] 
= —0.909217. 


After getting x’ and y’ at ¢ = 0.1, using x(0.1) and y(0.1), we then correct: 


x(O.1) = — 5(—0.8582) + (—0.9271)] 
= 0.913936; 

y(0.1) = —0.9303 + 0.029 (900. 8230) + 19(0.8632) — 5(0.9060) + (0.9515)] 
= —0.909217. 


The close agreement of predicted and corrected values indicates six-decimal-place 
accuracy. 

In this method, as we advance the solution to larger values of ¢, the comparison 
between predictor and corrector values tells us whether the step size needs to be changed. 


MILNE 


This method can be applied to a system of first-order equations exactly analogously to 
the application of Adams—Moulton, We do not illustrate it because it is subject to 
instability. 


t x 2 t y y 
0.000 1.0 -1.0 0.00 -1.0 1.0 
Starting 0.025 0.9759 -0.9271 0.025 -0.9756 0.9515 
values 0.050 0.9536 -0.8582 0.050 —0.9524 0.9060 
0.075 0.9330 -0.7929 0.075 =0,9303 0.8632 
Predicted 0.10 (0.9139) (-0.7310) 0.10 (-0.9092) (0.8230) 


Corrected 0.9139 —0.9092 
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5.11 


COMPARISON OF METHODS 


It is appropriate that we summarize the various methods that have been discussed in this 
chapter and compare them. Table 5.13 compares the accuracy, effort required, stability, 
and other features of the methods. 

The data in Table 5.13 lead us to draw the usual conclusion about the best scheme 
for solving a differential equation of higher order, or a system of N first-order equations: 
We begin with a fourth-order Runge—Kutta to get a total of four values for each of the 
functions (this also allows us to compute four values for each of the derivatives), and 
then advance the solution with Adams—Moulton. At each step after employing Adams— 
Moulton, we check* the accuracy and adjust the step size when appropriate. 

There is still a problem during the starting phase when Runge—Kutta is being used, 
For instance, when a Runge—Kutta—Fehlberg method (Section 5.4) is used to start the 
solution for a multistep method that needs equispaced function values, an additional 
restriction is imposed; the step size must be uniform, which may mean that closer spaced 
values may need to be computed than are required to meet the accuracy criterion alone. 

A method due to Hamming has been widely accepted. It is available through sub- 
routines in some FORTRAN libraries. It begins the solution with a fourth-order Runge— 
Kutta, and then continues with a predictor—corrector. The equations employed are 


4h. 
act 5h — fi-1 + 2f(-2) 


4 112 
= Yitip ~ D1 Vir — Vics 


1 ‘ . 
= glove — yi-2 + 3ASis im + 2f — fi-vI, 


9 
Yitie + Tay itp — Yit1,0)- (5.35) 


The predictor equation is the Milne predictor. Before correcting, the estimate of y,, 
is modified using the difference between the predicted and corrected values in the previous 
interval (this is omitted in the first interval, since these are not available). A corrector 
formula is used that depends on two previous y-values, though heavily weighted in favor 
of the last one. Finally, an adjustment is made based on the error estimate computed from 
the difference between predicted and corrected values. Note that, while two additional 
equations are employed in each step compared to the predictor—corrector methods pre- 
viously described, only two evaluations of the derivative function are needed, the same 
as before. For many applications, it is the evaluation of the derivative function that is 
costly in computer time—the two extra algebraic steps don’t count for much. 


*Ordinarily a weighted average of the N errors is monitored. Alternatively, the maximum error is controlled. 
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Table 5.13 Comparison of methods for differential equations 


Ease of 
Local Global Function changing 
Method Type error error ~—evaluation/step Stability step size Recommended? 

Modified Euler Single-step O(h) — O(h?) 2 Good Good No 
Fourth-order Runge-Kutta Single-step O(>) — O(h*) 4 Good Good Yes 
Runge—Kutta—Fehlberg. Single-step O(h®) OAS) 6 Good Good Yes 
Milne Multistep = O(H®) Oth) 2 Poor Poor No 
Adams—Moulton Multistep = O(h) Oh") 2 Good Poor Yes 


The special merits of Hamming’s method are stability combined with good accuracy. 
Like Milne’s and Adams’, the method has a local error of O(h5) and a global error of 
Ovn). 

Gear (1967) has proposed a predictor—corrector method that has an O(/1°) local error 
but uses only three previous steps rather than the four previous steps employed by Adams— 
Moulton and Milne. It obtains its high order of error by using recorrected values of the 
function and derivative values. The formulas are 


Ynt tap = —18¥ qn + Wn- 1,6, + 10Yn-2,c. + MAYn + 1Bhy_—1 + Fhyy—25 
Yat 1p = —S1Yn + 24Yn~ 1,0, + 33Yn-2,€. + 24hyn + SThy,-) + 1Ohyn-2, 
F = byneig — fOne Yerigs 
95 
Yate = Yntt.p — 388" 
3 
Yner = Yn + TEs 
‘ 11 
Yn-1.e2 = Yn-1,¢, — Tao" 


Wns = WYn+ Lp =F, 


This method is stable and is applicable to systems of first-order differential equations. In 
addition, Gear (1971) has a listing and a complete description of a subroutine called 
DIFSUB, which includes both the Adams predictor—corrector method and Gear’s stiff 
methods. This subroutine is also the basis of an IMSL subroutine DGEAR, which contains 
Adams methods up to order 12 and methods for stiff differential equations. 

A stiff equation results from phenomena with widely differing time scales. For exam- 
ple, the general solution of a differential equation may involve sums or differences of 
terms of the form ae“, be“', where both c and d are negative but ¢ is much smaller than 
d. In such cases, using a small value for the step size can introduce enough round-off 
errors to cause instability. 
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An example is the following: 


ww xX = 1195x — 1995y, —_-x(0) = 2, 
y’ = 1197x — 1997y, y(0) = —2. (5.36) 


il 


The analytical solution of Eq. (5.36) is 
x(t) = 10e~24 — 8280", wr) = Ge — Be 800", 


Observe that the exponents are all negative and of very different magnitude, qualifying 
this as a stiff equation. Suppose we solve Eq. (5.36) by the simple Euler method with 
h = 0.1, applying just one step. The iterations are 


Xin. = Xj + Af, yj) = x + 0.1(1195x; — 1995y,), 
Vier = Yi + hgcx;, yj) = yj; + 0.1(1197x; — 1997y;). 
This gives x(0.1) = 640, y(0.1) = 636, while the exact values are x(0.1) = 8.187 and 
(0.1) = 4.912. Such a result is typical (though here exaggerated) for stiff equations. 
One solution to this problem is to use an implicit method rather than an explicit one. 
All the methods so far discussed have been explicit, meaning that the new values, x,,; 
and y;,,, are computed in terms of the previous ones, x; and y;. The implicit form of 


Euler’s method is 
Xie. = Xi + AF Gi+ 1 Viv)» 


Vit = Yi + AG Vir) (5.37) 


If the derivative functions f(x, y) and g(x, y) are nonlinear, this is difficult to solve. 
However, in Eq. (5.36), they are linear. Solving Eq. (5.36) by use of Eq. (5.37), we 
have 


x41 = x + 0.1(1195x;,; — 1995y;44), 
Vier = + 0.1(1197x;,, — 1997y,44)- 


Since the system is linear, we can write 


xia] — [Gd — 1195.1) 1995(0.1)  )!f x, 
Yi —1197 (1 + 19970.1))} Ly; 
which has the solution x(0.1) = 8.23, (0.1) = 4.90, reasonably close to the analytical 


values. 
In summary, our results for the solution of Eq. (5.36) are 


(0.1) (0.1) 
Exact 8.19 4.91 
Euler 
Explicit 640 636 


Implicit 8.23 4.90 
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If the step size is very small, we can get good results from the simpler Euler after 
the first step. With h = 0.0001, the table of results becomes 


x(0.0001) y(0.0001) 


Exact 2.61 =1:39 


Euler 
Explicit 2.64 1.36 
Implicit 2.60 -141 


but this would require 1000 steps to reach ¢ = 0.1, and round-off errors would be large. 

If we anticipate some material from the next chapter, we can give a better description 
of stiffness as well as indicate the derivation of the general solution to Eq. (5.36). We 
rewrite Eq. (5.36) in matrix form: 


| allie _ {1195 —1995 
[;} =4[3} wee = [tier “ioe 


The general solution, in matrix form, is 


x e 4 
[*] = ae, + ce~8y,, 


o-GJm-E) 


You can easily verify that Av, = —2v, and Av, = —800v2. In Chapter 6 we will see that 
this means that v, is an eigenvector of A and that —2 is the corresponding eigenvalue. 
Similarly, v, is an eigenvector of A with the corresponding eigenvalue of —800. (In 
Chapter 6 you will learn how to find the eigenvectors and eigenvalues of a matrix.) 

A stiff equation can be defined in terms of the eigenvalues of the matrix A that 
represents the right-hand sides of the system of differential equations. When the eigen- 
values of A have real parts that are negative and differ widely in magnitude as in this 
example, the system is stiff. In the case of a nonlinear system 


where 


x,]' SQ. %, - %) 
xX Syl, X25 + + Xn) 


Xn Si hiy Rag + 2:3 7D 


one must consider the Jacobian matrix whose terms are af, /Ax;. See Gear (1971) for more 
information. 
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5.12 


5.13 


CHAPTER SUMMARY 


These are things you should be able to do if you understand the material in Chapter 5: 


1. Use the Taylor-series procedure to solve a first- or second-order differential equation. 
You should be able to explain why this method is not generally used. 

2. Solve first-order equations and systems of first-order equations with all of these 

methods: 

a. Simple Euler 

b. Modified Euler 

c. Fourth-order Runge-Kutta 

d. Runge—Kutta—Fehlberg 

e. Adams—Moulton (applied after starting the solution with another method) 

Rewrite a higher-order differential equation or system of higher-order equations as 

an equivalent system of first-order equations. 


w 


4, Explain these terms: efficiency, accuracy, stability, and convergence as applied to 
the procedures of this chapter. In particular, you should be able to explain why 
Milne’s method is rejected. 

5. Compare all of these methods for accuracy, efficiency, stability, ability to monitor 
errors, and ease of changing step size. 

6, Outline the arguments used to show whether errors grow as they propagate and 
whether a method is stable. 

7. Explain what is meant by the term stiff equation and give examples of explicit and 
implicit methods. 

8. Utilize computer programs to obtain the solution to a given differential equation or 
system of equations. 


SELECTED READINGS FOR CHAPTER 5 
Gear (1971); Press, Flannery, Teukolsky, and Vetterling (1986); Stoer and Bulirsch (1980). 


COMPUTER PROGRAMS 


Program | (Fig. 5.3) solves a second-order differential equation by the modified Euler 
method, after reducing it to a pair of first-order equations. The computer output is the 
solution to the equation that represents a damped spring system, 

Wd*x | dk 


ene oe 


In the program, each of the constants in the differential equation has the same name 
except that k is called XK and w is called C. In the program, XO and TO are the initial 
values of displacement and time. The variable y is used for dx/dr, and YO is its initial 


5.13: COMPUTER PROGRAMS 


PROGRAM PEULER (INPUT, OUTPUT) 


PROGRAM FOR EULER PREDICTOR-CORRECTOR SOLUTION TO A SECOND 
ORDER DIFFERENTIAL EQUATION OR TWO SIMULTANEOUS FIRST ORDER 
EQUATIONS. 


REAL X,Y,T,A,B,C,XK,W,X0,Y0,T0, TEND, DT, TOL, YP, DXO, DYO 
12,XC, YC, TOUT 
INTEGER I 


DEFINE THE DERIVATIVE FUNCTIONS. FOR A SECOND ORDER EQUATION, 
ALWAYS USE FCN1 = VARIABLE THAT REPLACES THE FIRST DERIVATIVE. 


FCN1(X,Y,T) 6 
FCN2 (X,Y,T) ( A*SIN(C*T) - B*Y - XK*X ) * 32.1725 / W 


READ THE CONSTANTS OF THE PROBLEM 


READ*, W,B,XK,A,C 

READ*, X0,Y0,T0,TEND,DT, TOL 

DATA W,B,XK,A,C/10,5,6,20,2/ 

DATA X0,Y0,TO,TEND,DT,TOL/1,-2,0,2,0.05,0.01/ 


PRINT 200 
PRINT 201, T0,X0,¥0 


COMPUTE BY PREDICTOR FORMULAS 


5 IF ( TO .LT. TEND ) THEN 
DXO = FCN1(X0,Y0,TO) 
XP = XO + DT*DXO 
DYO = FCN2(x0,Y0,TO) 
YP = YO + DT*DYO 


NOW CORRECT. REPEAT CORRECTIONS UP TO FIVE TIMES. 


bdo 10 I = 1,5 
xC = XO + DT*( DXO + FCN1(XP,YP,TO+DT) ) 
YC = YO + DT*( DYO + FCN2(XC,YP,TO+DT) ) 
Z = ( ABS(XC - XP) + ABS(YC - YP) ) / 2.0 
IF (2 .LE. TOL ) GO TO 20 
XP = XC 
YP = YC 

CONTINUE 

TOUT = TO + DT 

PRINT 202, TOUT 


RESET VARIABLES. PRINT A NEW LINE AND GO ON TO NEXT STEP. 


Figure 5.3 Program 1. 
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Figure 5.3 (continued) 


20 TO =T0 + DT 
PRINT 201, TO,XC,¥C~«- 
x0 = XC 
XOv—ave 
Go TO 5 
END IF 


200 FORMAT (///2X,’ TIME’ , 7X, ‘DISTANCE’ ,15X,’ VELOCITY’ /) 

201 FORMAT (1X,F5.2,6X,F10.7,12X,F13.7) 

202 FORMAT(1X,’AT T = ',F5.2,’ TOLERANCE NOT MET WITH FIVE’, 
+ ’ RECORRECTIONS ‘) 
STOP 
END 


OUTPUT FOR PROGRAM 1 


DISTANCE VELOCITY 


1.0000000 -2.0000000 
+9152416 -1.4002323 
+8583095 -. 8601237 
- 8270768 -.3760873 
. 8189385 0622915 
. 8316736 -4585103 
- 8631817 - 8132604 
+ 9113776 +1261918 
+9741545 + 3966751 

1.0493767 + 6241366 

1.1348858 + 8082081 

1.2285121 -9487964 

1.3280888 + 0461160 

1.4314684 -1007043 

1.5365380 +1134275 

1.6412367 +0854794 

1.7435708 - 0183761 

1,8416304 -9139465 

1.9336045 + 7743193 

2.0177954 -6019060 

2.0926322 . 3993814 

2.1566839 .1696611 

2.2086704 +9158759 

2,.2477825 - 6362084 

2.2725406 + 3426194 

2.2822756 +0372065 

2.2764537 +2771621 

2.2547018 -5971912 

2.2168206 -9193613 

2.1627919 -2400556 

2.0927802 5556447 

2.0071316 + 8625485 

1.9063695 +1572853 

1.7915245 + 4420739 

1,6629711 + 7060522 

1.5217995 +9474093 

1.3691976 - 1639028 

1.2064641 .3532707 

1.0350061 +5134185 
+ 8563286 -6425097 
+ 6720204 + 7390133 


1, 
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1. 
a 
1. 
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value. The solution is computed from TO to TEND, using a step size equal to DT. TOL 
is a value used to determine whether recorrections should be continued or not; they 
terminate when the change on recorrection is less than TOL. The parameters and initial 
conditions are read in. The value of the function is recorrected until the difference in 
successive values is less than a certain tolerance. If the tolerance is not met in five 
recorrections, the value is printed together with a message to indicate that the tolerance 
was not met. 

Program 2 (Fig. 5.4) uses a subroutine, RK4TH, to solve a single first-order differ- 
ential equation of the form dx/dt = f(x, 1), using the Runge-Kutta method. Parameters 
are passed to specify the initial value of the function, XO, and the initial value and the 
increment for the independent variable, TO and H. The subroutine returns X, the value 
of the function at the end of the interval. An external function calculates values of the 
derivative function 

The ecologist’s problem concerning the growth in population of field mice, described 
at the beginning of this chapter, is solved in Program 2 using Ar = 0.2. In this problem 
one of the constants in the differential equation, B, is known as a function of time only 
through a table of values. To find the proper values of this constant, a subroutine that 
does interpolation is called. (This subroutine, INTERP, is the same one presented earlier, 
in Chapter 3.) The values of the two constants needed by the function are passed through 
COMMON. (They cannot be passed as parameters because the subroutine RK4TH was 
written with a call to the function that did not include them as parameters.) 

The output from Program 2 shows how the mouse population is expected to vary. 
Of course, only whole numbers of mice are possible; our mathematical model considers 
population as a continuous variable. We are not surprised to see that the population 
reaches a maximum as the food supply diminishes with time. 

The third program is PRKSYST (Fig. 5.5), which solves a system of N simultaneous 
first-order equations by the Runge—Kutta—Fehlberg method. Parameters passed to the 
subroutine include the name of a function that computes each of the N derivative values, 
the initial value and the step size for the independent variable, an array of function values 
at the beginning of the interval, and the number of equations. Values of the functions at 
the end of the interval are returned in an array. A doubly dimensioned array (declared of 
size 6 x N in the main program) is employed to hold intermediate values. 

This routine can be used to solve an nth-order differential equation, or a system of 
higher-order equations, by reducing each of them to a set of simultaneous first-order 
equations. 


PROGRAM PRK4TH (INPUT, OUTPUT) 


MAIN PROGRAM TO SOLVE THE FIELD MICE PROBLEM. IT EMPLOYS TWO 
SUBROUTINES. ROUTINE RK4TH SOLVES THE DIFFERENTIAL EQUATION AND 
ROUTINE INTERP INTERPOLATES IN A TABLE OF VALUES FOR ONE OF THE 
PARAMETERS. THE PARAMETERS OF THE DIFFERENTIAL EQUATION ARE 
PASSED TO A FUNCTION SUBPROGRAM THROUGH COMMON. THIS FUNCTION 
IS AN EXTERNAL SUBPROGRAM BECAUSE IT IS PASSED AS A PARAMETER, 


aaaaaaaaaa 


Figure 5.4 Program 2. 
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Figure 5.4 (continued) 
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REAL A, BOUT,B(9) ,D(10) ,FCT,T0,X0,H,TF,X 
INTEGER I : 

EXTERNAL FCT 

COMMON /PARAM/A, BOUT 

& = 07,9) 


READ IN THE VALUES OF THE CONSTANT B, WHICH ARE DEFINED ONLY AS 
A TABLE OF VALUES FOR VARIOUS T VALUES. 


READ*, ( B(I), I = 1,9 ) 
READ*, TO,X0,H,TF 


WRITE HEADING AND INITIAL VALUES 


PRINT 200 
PRINT 201, TO,X0 


CALL THE SUBROUTINE RK4TH FOR THE NEXT VALUE AND PRINT IT, BUT 
MUST GET PROPERLY INTERPOLATED B VALUE FROM INTERP. 


5 CALL INTERP(0.0,8.0,1.0,9,B,TO,3, BOUT) 
CALL RK4TH (FCT, TO,H,X0,X) 


ADVANCE VARIABLES 


TO = T0 + H 


wu 


x 
PRINT 201, TO,X 
ARE WE DONE ? TEST X VALUE TO SEE. 
IF ( T0 .LT. IF ) GO TO 5 


200 FORMAT (///’SOLUTION TO A DIFFERENTIAL EQUATION’ ,/ 


+ . BY RK4TH METHOD’, 
cd JSR UL" USE YI NSS) 

201 FORMAT (1X,F6.1,1X,F14.1) 
STOP 
END 


REAL FUNCTION FCT(XN,T) 


THIS FUNCTION COMPUTES VALUES FOR THE DIFFERENTIAL EQUATION. 


REAL A, BOUT, XN,T 
COMMON /PARAM/A, BOUT 

FCT = A*XN — BOUT*XN**1.7 
RETURN 

END 
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Figure 5.4 (continued) 
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SUBROUTINE RK4TH : 

THIS SUBROUTINE ADVANCES THE SOLUTION OF A 
FIRST ORDER DIFFERENTIAL EQUATION OF THE FORM DX/DT = F(X,T), 
USING THE RUNGE-KUTTA FOURTH ORDER METHOD. 


PARAMETERS ARE : 


FCN - FUNCTION SUBPROGRAM TO COMPUTE DX/DT = F(X,T). THIS 
MUST BE DECLARED EXTERNAL IN THE CALLING PROGRAM. 

x0,TO - X AND T VALUES AT THE BEGINNING OF THE INTERVAL. 

H - STEP SIZE, THE VALUE OF DELTA T. 

x - X VALUE AT THE END OF THE INTERVAL AS RETURNED. 


REAL FCN, T0,H,X0,X,XK1,XK2,XK3,XK4 


XK1 = H * FCN(X0,TO) 

XK2 H * FCN(X0+XK1/2.0,TO+H/2.0) 

XK3 H * FCN(X0+XK2/2.0,TO+H/2.0) 

XK4 = H * FCN(X0+XK3,TO+H) 

X = XO + (XK1 + 2.0*XK2 + 2.0*XK3 + XK4) / 6.0 
RETURN 

END 


SUBROUTINE INTERP : 
SUBROUTINE TO INTERPOLATE IN A TABLE OF 
UNIFORMLY SPACED VALUES. 


PARAMETERS ARE : 


X1,XN - BEGINNING AND ENDING X VALUES 

H - DELTA X - THE UNIFORM SPACING 

N - NUMBER OF ENTRIES IN THE TABLE 

xX - ARRAY OF FUNCTION VALUES 

x - VALUE AT WHICH Y IS TO BE INTERPOLATED 

M - DEGREE OF INTERPOLATING POLYNOMIAL. THE SUBROUTINE WILL 
HANDLE UP TO 10TH DEGREE, BUT USUALLY THE DEGREE WILL BE 
LESS THAN 10 TO AVOID ROUND OFF ERRORS. 

YOUT - THE INTERPOLATED Y VALUE RETURNED TO THE CALLER 


AN ARRAY D IS USED IN THE SUBROUTINE TO HOLD DELTA Y’S. 
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Figure 5.4 (continued) 
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REAL X1,XN,H, ¥ (N}~X, YOUT 
INTEGER N,M,I,d,K 
REAL D(10),FM,FJ,X0,¥0,S,FNUM, DEN, FI 


FIRST FIND PROPER SUBSCRIPT FOR YO SO THAT X IS CENTERED IN THE 
DOMAIN AS WELL AS POSSIBLE. THIS SUBSCRIPT VALUE IS CALLED J. 


FM=M+1 

J = ( X - Xl )/H - FM/2.0 + 2.0 
IF ( X .LE. X1 + FM/2.0*H) J 
IF ( X .GE. XN - FM/2.0*H) J 
FI=7 

xO = X1 + ( FJ - 1.0 )*H 

YO = Y(J) 


28 
y-&« 


COMPUTE THE DIFFERENCES THAT ARE NEEDED 


DO 10 I = 1,M 
D(I) = ¥(J+1) - Y¥(J) 
J=7+1 
10 CONTINUE 


guauE~ 


D(K) = D 
15 CONTINUE 
20 CONTINUE 
END IF 


COMPUTE S VALUE 
S=(X-x0)/48 
COMPUTE INTERPOLATED Y VALUE 


YOUT = YO 

FNUM = S 

DEN = 1.0 

DO 30 I = 1,M 
FI =I 
YOUT = YOUT + FNUM/DEN*D (I) 
FNUM = FNUM * ( S - FI ) 
DEN = DEN * ( FI + 1.0 ) 

30 CONTINUE 
RETURN 
END 


OUTPUT FOR PROGRAM 2 


SOLUTION TO A DIFFERENTIAL EQUATION 


BY RK4TH METHOD 
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Figure 5.4 (continued) 
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PROGRAM PRKSYST (INPUT, OUTPUT) 


THIS PROGRAM USES SUBROUTINE RKFSYS TO SOLVE THE SYSTEM 
OF EQUATIONS: 
Xx’ = F(T,X,Y) = ¥ 
Y’ = G(T,X,Y) = X+T 
WHERE X,Y ARE THE DEPENDENT VARIABLES AND T IS THE 
INDEPENDENT VARIABLE. 


WE COMPARE THE COMPUTED SOLUTIONS WITH THE EXACT ONES: 
X(T) = COSH(T) - T 
Y(T) = SINH(T) - 1 


aaaaaaaaaaaa 


Figure 5.5 Program 3. 
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Figure 5.5 (continued) 
REAL T0,H,XEND(10),F(10),X0(10), 
TEINAL, TPRINT, TOL 
INTEGER N,1,J 


aaa 


COMMON TOL, TFINAL 
EXTERNAL DERIVS 

DATA N,X0/2,1,-1,8*0.0/ 
DATA XEND/1,-1,8*0.0/ 
DATA T0/0.0/ 

DATA H/0.1/ 


aaana 


PRINT 200 
TOL = 0.0001 
TFINAL = 1.5 
TO = 0.0 
10 IF (TO .LT. TFINAL+0.001) THEN 
PRINT 100, TO,XEND(1),XEND(2),COSH(TO)-TO,SINH(TO)-1.0 
CALL RKFSYS (DERIVS, T0,H,X0,XEND,F,N) 
X0(1) = XEND(1) 
x0(2) = XEND(2) 
GO TO 10 
ENDIF 
c 
100 FORMAT (5X,F5.3,2X,5F15.7) } 
200 FORMAT (///T5,’ TIME’ ,T20,’X(TIME)’,T35,’Y(TIME)’ 


+ ,T50,'X-EXACT’ , T65,’Y-EXACT’ /) 
STOP 
END 
c 
Cpr nn rn rn nnn { 
c 


SUBROUTINE RKFSYS : 

THIS SUBROUTINE SOLVES A SYSTEM OF N FIRST 
ORDER DIFFERENTIAL EQUATIONS BY THE RUNGE-KUTTA-FEHLBERG ' 
METHOD. THE EQUATIONS ARE OF THE FORM: { 


DX1/DT = F1(X,T) , DX2/DT = F2(X,T),ETC. 
DX2/DT = F2(X,T),ETC. | 


PARAMETERS ARE : 


DERIVS - A SUBROUTINE THAT COMPUTES VALUES OF THE N DERIVATIVES. 
IT MUST BE DECLARED EXTERNAL BY THE CALLER. IT IS | 
INVOKED BY THE STATEMENT : } 
CALL DERIVS (X,T,F,N) 


TO - THE INITIAL VALUE OF INDEPENDENT VARIABLE \ 
H - THE INCREMENT TO T, THE STEP SIZE | 
x0 - THE ARRAY THAT HOLDS THE INITIAL VALUES OF THE FUNCTIONS 

XEND - AN ARRAY THAT RETURNS THE FINAL VALUES OF THE FUNCTIONS 

XWRK - AN ARRAY USED TO HOLD INTERMEDIATE VALUES DURING THE 


aAAAAAAAAAAAAAAAADAAAAAAA 


COMPUTATION. IT MUST BE DIMENSIONED OF SIZE 4 X N IN 
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Figure 5.5 (continued) 
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THE MAIN PROGRAM. 
N - THE NUMBER OF EQUATIONS IN THE SYSTEM BEING SOLVED 
F - AN ARRAY THAT HOLDS VALUES OF THE DERIVATIVES 


REAL X0(N),XEND(N),XWRK(6,10),F(N),H,TO , TOL 
INTEGER I,N 


LOCAL VARIABLES 
REAL ERROR, SUM, ABS, MAX, STOREH, TEND 


COMMON TOL, TFINAL 


INITIALIZE FOR INTERVAL [TO, TO+H] 


TEND = TO +H 
STOREH = H 


CHECK TO SEE IF WE ARE FINISHED 


IF (TO .GE. TEND) THEN 
H = STOREH 
RETURN 


GET FIRST ESTIMATE OF THE DELTA X’S 


CALL DERIVS (x0,T0,F,N) 
DO 10 I = 1,N 
XWRK(i,I) = H * F(I) 
XEND(I) = XO(I) + XWRK(1,1)/4.0 
10 CONTINUE 


GET THE SECOND ESTIMATE. THE XEND VECTOR HOLDS THE X VALUES. 


CALL DERIVS (XEND,TO+H/4.0,F,N) 
DO 20 I =1,N 
XWRK(2,I) = H * F(I) 
XEND(I) = XO(I) + (XWRK(1,1)*3.0 + XWRK(2,1)*9.0)/32.0 
20 CONTINUE 


REPEAT FOR THIRD ESTIMATE 


CALL DERIVS (XEND,T0+3.0*H/8.0,F,N) 
DO 30 I = 1,N 
XWRK(3,I) = H * F(I) 
MEND(I) = XO(I) + (XWRK(1,I)*1932.0 - XWRK(2,1)*7200.0 
+ XWRK (3,1I)*7296.0) /2197.0 
30 CONTINUE 
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Figure 5.5 (continued) 
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NOW GET FOURTH ESTIMATE 


9000 


CALL DERIVS (XEND, T0+12.0*8/13.0,F,N) 
DO 40 I =1,N 
XWRK(4,I) = H * F(I) 
XEND(I) = XO(I) + (439.0*XWRK(1,2)/216.0 - 8.0*XWRK(2,1) 
+ 3680.0*XWRK (3,1) /513.0 
a - 845.0*XWRK (4,2) /4104.0 ) 
40 CONTINUE 


NOW GET FIFTH ESTIMATE 


990000 


CALL DERIVS (XEND, T0+H,F,¥) 
DO 50 I = 1,N 
XWRK (5,1) = H*F (I) 
XEND(I) = XO(Z) - 8.0*XWRK(1,1)/27.0 + 2.0*XWRK(2,1) 
- 3544.0*XWRK(3,I)/2565.0 + 1859.0*XWRK (4,1) /4104.0 
. - 11.0*XWRK(5,1)/40.0 
50 CONTINUE 


c 
C NOW GET SIXTH ESTIMATE q 


CALL DERIVS (XEND, T0+H/2.0,F,N) 
DO 60 I = 1,N 
XWRK(6,I) = H*F(I) 
0 CONTINUE 


WE ESTIMATE THE ERROR BY COMPUTING THE DIFFERENCE BETWEEN 
THE FOURTH AND FIFTH ORDER EQUATIONS. 


suM = 0.0 

ERROR = 0.0 

DO 70 I= 1,N 

SUM = ABS (XWRK(2,1)/360.0 - 128.0*XWRK(3,1I)/4275.0 
— 2197.0*XWRK(4,1)/75240.0 
+ XWRK(5,1I)/50.0 + 2.0*XWRK(6,1)/55.0) 
ERROR = AMAX1 (ERROR, SUM) 
70 CONTINUE 


a0 


IF (ERROR .LT. TOL) THEN 


WE COMPUTE THE X AT THE END OF THE INTERVAL FROM A WEIGHTED AVERAGE 
OF THE SIX ESTIMATES, THEN RETURN. 


929000 


DO 80 I = 1,N 
XEND(I) = XO(I) + 16.0*XWRK(1,I)/135.0 
+ 6656.0*XWRK (3,1) /12825.0 
+ 28561.0*XWRK (4,1) /56430.0 
= 9.0*XWRK(5,T)/50.0 
+ 2.0*XWRK(6,1)/55.0 
X0(I) = XEND(Z) 
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Figure 5.5 (continued) 


80 
c 
c 


aaa 


aaaanaaanna 


a0aaQq 


CONTINUE 
TO =TO +H 


END IF 

IF (ERROR .GT. TOL) THEN 
H = H/2.0 

ENDIF 

IF (ERROR .LT. H*TOL/10.0) THEN 
H = H*2.0 

ENDIF 

IF (TO + H .GT. TEND) THEN 
H = TEND - TO 

ENDIF 


ENDIF 


DEFINE THE FUNCTIONS OF THE PROBLEM IN TERMS OF THE 
F’S AND THE X’S. 


SUBROUTINE DERIVS (X,T,F,N) 
REAL X(N) ,T,F(N) 


INTEGER N 
Re cn F(1) = X(2) 
bo te il F(2) = X(1) +T 

F(1) = X(2) 

F(2) = X(1) +T 

RETURN 

END 

OUTPUT FOR PROGRAM 3 

TIME X (TIME) ¥ (TIME) X-EXACT 

+000 1.0000000 -1.0000000 1.0000000 

+100 «9050042 - . 8998333 -9050042 

+200 -8200668 -.7986640 - 8200668 

+300 + 7453385 -.6954797 «7453385 

+400 «6810724 ~.5892477 - 6810724 

+500 + 6276260 -.4789047 - 6276260 

+600 + 5854652 -.3633464 . 5854652 

+700 -5551690 ~-.2414163 .5551690 

-800 +5374349 -.1118940 - 5374349 

-900 + 5330864 +0265167 - 5330864 

1.000 + 5430806 «1752012 «5430806 

1.100 -5685185 + 3356475 + 5685186 

1.200 +6106555 -5094613 +6106556 

1.300 «6709142 + 6983824 + 6709142 + 6983824 
1.400 - 7508984 +9043015 + 7508985 + 9043015 
1.500 - 8524096 1.1292794 «8524096 1.1292795 
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EXERCISES 


Section 5.2 = 
> I. a))Solve the differential equation 


Paxtyty, yO) = 1 


by Taylor-series expansion to get the value of y at x = 0.1 and at x = 0.5. Use terms 
through x. 
b) Do the same for 


dy 
—=-y -— 2 y( =- 
Vi 2, (0) 1. 


(Analytical solution is y(x) = —3e™* — 2x + 2.) 
c) Do the same for 
dy 5x 
no yD (0) = 2. 


(Analytical solution is y(x) = V 


nm 


The general solution to a differential equation normally defines a family of curves. For the 
differential equation 


find, using the Taylor-series method, the particular curve that passes through (1, 0). Also 
find the curve through (0, 1). Compare to the analytical solutions. 


» 3. Use the Taylor-series method to get y at x = 0.2(0,2)0.6,* given that 
y" =x, y(0) = 1, y'(0) = 1. 


4, A spring system has resistance to motion proportional to the square of the velocity, and its 
motion is described by 

d*x dx 

oe 0.1(5 


a y + 0.6 = 0. 


If the spring is released from a point that is a unit distance above its equilibrium point, 
x(0) = 1, x'(0) = 0, use the Taylor-series method to write a series expression for the displace- 
ment as a function of time, including terms up to 1°. 


Sextion 5.3 
» 5. | Use the simple Euler method to solve for y(0.1) from 
dy 
mrrtytm yO) = 1, 


with h = 0.01. Comparing your result to the value determined by Taylor series in Exercise 
1, estimate how small A would need to be to obtain four-decimal accuracy. 


*This notation means for x = 0.2 through x = 0.6 with increments of 0.2. 


-9 
10 
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Solve the differential equation 
4 
da) 
by the simple Euler method with h = 0.1, to get y(1). Then repeat with h = 0.2 to get 


another estimate of y(1). Extrapolate these results, assuming errors are proportional to step 
size, and then compare them to the analytical result. (Analytical result is y? = 1 + x?.) 


y(0) = 1, 


" Repeat Exercise 5, but with the modified Euler method with h = 0.025, so that the solution 


is obtained after four steps. Comparing with the result of Exercise 5, about how much less 
effort is it to solve this problem to four decimals with the modified Euler method in comparison 
to the simple Euler method? 


Find the solution to 
dy 
Vey 
de - 


24 2, y(l) = 0, attr = 2, 

by the modified Euler method. using h = 0.1. Repeat with h = 0.05. From the two results, 
estimate the accuracy of the second computation. 

Solve y’ = sin x + y, y(0) = 2, by the modified Euler method to get y at x = 0,1(0.1)0.5. 
A sky diver jumps from a plane, and during the time before the parachute opens, the air 
resistance is proportional to the 3 power of the diver’s velocity. If it is known that the maximum 
rate of fall under these conditions is 80 mph, determine the diver’s velocity during the first 
2 sec of fall using the modified Euler method with Ar = 0.2. Neglect horizontal drift and 
assume an initial velocity of zero. 


Section 5.4 


16, 


Solve Exercise 5 by the Runge-Kutta method but with h = 0.1 so that the solution is obtained 
in only one step. Carry five decimals. and compare the accuracy and amount of work required 
with this method against the simple and modified Euler techniques in Exercises 5 and 7. 


Solve Exercise 8 by the Runge-Kutta method. using h = 0.2, 0.1, and 0.05. 
Determine y at x = 0.2(0.2)0.6 by the Runge-Kutta technique. given that 


di 
de x+y’ 


(0) = 2 


Using the conditions of Exercise 10, determine how long it takes for the jumper to reach 
90% of his or her maximum velocity, by integrating the equation using the Runge-Kutta 
technique with Ar = 0.5 until the velocity exceeds this value, and then interpolating. Then 
use numerical integration on the velocity values to determine the distance the diver falls in 
attaining 0.9v,,..- 

It is not easy to know the accuracy with which the function has been determined by either 
the Euler methods or the Runge-Kutta method. A possible way to measure accuracy is to 
repeat the problem with a smaller step size, and compare results. If the two computations 
agree to n decimal places, one then assumes the values are cortect to that many places. Repeat 
Exercise 14 with Ar = 0.3, which should give a global error about one-eighth as large, and 
by comparing results, determine the accuracy in Exercise 14. (Why do we expect to reduce 
the error eightfold by this change in Ar?) 


Write a FORTRAN program to implement the Runge—Kutta—Fehlberg algorithm for solving 
an initial-value differential equation of the form y’ = f(x. y), y(xo) = A. 
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17. 


Solve Exercises 1, 6, and 9 using the Runge—Kutta—Fehlberg algorithm. 


>18. Solve y’ = 2x — y, (0) = —1, by the Runge—Kutta—Fehlberg algorithm for x = 0.2(0.2)1.0. 


~) (The exact answer is y(x) = e* + 2x — 2.) 


19. a) Exercise 15 suggests a way to estimate the accuracy of a classical Runge-Kutta com- 
putation. Runge—Kutta—Fehlberg has an advantage over the classical fourth-order pro- 
cedure because it gives an error estimate without additional computations. Use Runge— 
Kutta—Fehlberg in Exercise 14 with a step size of t = 0.5. Compare the estimate of 
accuracy with what you obtained in Exercise 15. 

b) Repeat part (a) but now use Runge—Kutta—Merson. 

Section 5.5 


»20. The equation 


ie) 
i= 


dy _ nee 40) = 
x fe N= AyD, 0) = 0, 


has initial values as follows: 


x y f 

0 0 0 
0.1 —0.01005 —0.20201 
0.2 —0.04081 —0.41632 
03 —0.09417 —0.65650 


0.4 —0.17351 —0.93881 


a) Using the Adams procedure described in Section 5.5. compute (0.5) by fitting a quadratic 
through the last three values of f(x, y). Compare to the exact answer: —0,28403. 

b) Repeat, using the last four points to fit a cubic. 

c) Repeat again, but use five points to fit a quartic 


For the differential equation 


(0) = 1, 


starting values are known: 
y(0.2) = 1.2186, (0.4) = 1.4682, y(0.6) = 1.7379. 
Use the Adams method, fitting cubics with the last four (y, 1) values and advance the solution 
to t = 1.2. Compare to the analytical solution. 
For the equation 
dy 


es ears 1) =0, 
= t t, yal) 


the analytical solution is easy to find: 


If we use three points in the Adams method, what error would we expect in the numerical 
solution? Confirm your expectation by performing the computations. 
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23. Continue the results of Exercise 13 using Adams’ method, employing cubics to extrapolate 


the y’-values, 
Section 5.6 
24. For the differential equation, 
ay 
Gey-x yO) = 1 


starting values are known: 
(0.2) = 1.2186, y(0.4) = 1.4682, y(0.6) = 1.7379. 


Use the Milne method to advance the solution to x = 1.2. Carry four decimals and compare 
to the analytical solution. 


»25. For the differential equation dy/dx = x/y, the following values are given: 


—_ 
x y 
o vi 
1 v2 
2 V5 
3. «V10 


To how many decimal places will Milne’s method give the value at x = 4? How many decimal 
places must be carried in the starting values of y to ensure this accuracy? 


26. For the equation y' = y sin wx, y(0) = 1, get starting values by the Runge—Kutta—Fehlberg 
method for x = 0.2(0.2)0.6, and advance the solution to x = 1.0 by Milne’s method. 
»27. Continue the results of Exercise 13 to x = 2.0 by the method of Milne. If you find that the 
corrector formula reproduces the predictor values, double the value of h after sufficient values 
are available. 


28. Check that y,, as defined by Eq. (5.15) is a solution of the difference equation, Eq. (5.14). 
29. Perform the long division to show that 
Ah 
Z,=1+Ah+ Oh), Z,= = 4 + O(h*), 
as given in Eq. (5.18). 
Section 5.7 


30. Express the differences in Eqs. (5.20) and (5.21) in terms of functional values to show that 
Egs. (5.22) are equivalent. 


Solve Exercise 21 using the Adams—Moulton method. 
32. Repeat Exercise 26 using the Adams—Moulton method. 
>33. For the equation 
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using h = 0.2, compute three new values by the Runge—Kutta method (four decimals). Then 
advance to x = 1.4 using the Adams—Moulton method, If you find the-accuracy criterion is 
not met, use Eqs. (5.23) to interpolate additional values so that four-place accuracy is 
maintained. 


34. Derive the interpolation formulas given in Eq. (5.23). 
Section 5.8 
»35. Given the linear differential equation dy/dx = y sin x. 
a) What is the maximum value of h that ensures convergence of the Adams—Moulton method 
when continuing applications of the corrector formula are made? 
b) [fan one-tenth of this maximum value is used, how close must the predictor and corrector 
values be so that recorrections are not required? 
c) In terms of the maximum A in part (a), what size of A is implied in the accuracy criterion, 
D- 10" < 14.2? 
36. Repeat Exercise 35 for the differential equation 
e =xt+y%, (0) =0, 
in the neighborhood of the point (1.0, 0.15). 

37. Repeat Exercise 35, using the Milne method. For part (c), the accuracy criterion is 
D- 10" < 29. 

38. Derive Eq. (5.28). 

39. Derive convergence criteria similar to Eqs. (5.26) and (5.27) for the Euler predictor—corrector 
method. Why can one not derive an accuracy criterion similar to those for the methods of 
Milne and Adams—Moulton? 

Section 5.9 

40, Estimate the propagated error at each step when the equation 

dy 
S=x+ (0) = 
mntty y(0) = 1, 
is solved by the simple Euler method with h = 0.02, for x = 0(0.02)0.1. (The equation is 
solved by this method in Section 5.3.) Compare to the actual errors. 
»41. Follow the propagated error between x = | and x = 1.6 when the simple Euler method is 
used to solve 
wy 
2 = xy, yd) = 1. 
Take h = 0.1. Compare to the actual errors at each step. The analytical solution is y = 
2/(3- x?). 
42. We can derive the global error (Eq. 5.31b) for Euler’s method without making use of the 


second-order difference equation. With the same assumptions about M and K and using the 
fact that the series 


eax pata eie hr 


Figure 5.6 
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show that 


€, = 


7 (e(%e-“0)K — 1), 


(Hint: Let s = 1 + AK.) 


Section 5.10 
43. The mathematical model of an electrical circuit is given by the equation 
@Q dQ = . 
0.5 de +6 dt + 50Q = 24 sin 107, 


m4, 


45, 


with Q = 0 and J = dQ/dt = 0 at t = 0. Express as a pair of first-order equations. 


In the theory of beams it is shown that the radius of curvature at any point is proportional 
to the bending moment: 


y’ 


ao 


= M(x), 
where y is the deflection of the neutral axis. In the usual approach, (y’)? is neglected in 
comparison to unity, but if the beam has appreciable curvature, this is invalid, For the cantilever 
jbeam for which y(0) = y'(0) = 0, express the equation as a pair of simultaneous first-order 
equations, 
The motion of the compound spring system as sketched in Fig. 5.6 is given by the solution 
of the pair of simultaneous equations. 

d* d’y 


i ar i —kyy — ky = Y2)s me = ky, — Yo) 


where y, and y> are the displacements of the two masses from their equilibrium positions. 
The initial conditions are 


y,(0) = A, y(0) = B, yx(0) = C, y(0) = D. 


Express as a set of first-order equations. 


S. 


m 


m 
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> 
>46. Solve the pair of simultaneous equations ufe Ms 


dx/dt=xy +t, x0)=0, aefdr=x-t, yO) =1, 


by the modified Euler method for r = 0.2(0.2)0.6. (Carry three decimals rounded.) Recorrect 
until reproduced to three decimals. 


47. Advance the solution of Exercise 46 to x = 1.0 (h = 0.2) by the Adams—Moulton method. 
48. Repeat Exercise 47 but use Milne’s method. 
+49) Find y at x = 0.6, given that 
Y= HOH, SOS =i 
Begin the solution by the Taylor-series method, getting 
0.1), y(0.2), y(0.3). 


Then advance to x = 0.6 employing the Adams—Moulton technique with h = 0.1 on the 
equivalent set of first-order equations. 
50. Express the third-order equation 


yy = =sH=t (0) = y"(0) = 0, y'(0) = 1, 


as a set of first-order equations and solve at 1 = 0.2, 0.4, 0.6 by the Runge-Kutta method 


(h = 0.2). 
51. Using the Adams—Moulton method with h = 0.5, advance the solution of Exercise 50 to 
t = 1.0. Estimate the accuracy of the value of y att = 1.0. 
Section 5.11 
»S2. For a resonant spring system with a periodic forcing function, the differential equation is 
d*x A 
ae + 64x = 16 cos 87, x(0) = x'(0) = 0. 


Determine the displacement at t = 0.1(0.1)0.8 by the method of Eq. (5.35), getting 
the starting values by any other method of this chapter. Compare to the analytical solution 
t sin 8¢. 
53. Apply Gear's formula for solving a differential equation, as given in Section 5.11, to the 
problem of Exercise 52. 
54. For the first-order equation 
ay 


bef 2 
at 


=sP=12+62=1, 390) =a, 


it is not difficult to verify that the solution is 
3 
y = ae + - hk 
If a = 0 (so y(0) = 0), the solution reduces to a simple cubic in r. Still, accumulated round- 
off errors act as if the initial value of y is not exactly zero, causing the exponential term to 
appear when it should not. This makes the problem a stiff equation. Assume that y(0) = 0 
and demonstrate that you do not get the analytical answer when you integrate from rf = 0 to 
t = 2 using the Runge-Kutta fourth-order method with a step size of 0.005. ‘ 


Figure 5.7 
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56. 


ay 


58. 


59 


In finding the deflection of a beam, the y’ term is often neglected (see Exercise 44). (The 
analytical method is easy if y’ is neglected in the equation.) What is the difference in the 
calculated values of the maximum deflection when the nonlinear relationship (Exercise 44) 
is used in comparison to the simpler linear equation? For light loads, the error is negligible, 
of course; only for heavier ones is there a need to use the nonlinear equation. At what value 
of the load is the error equal to 1% of the true value? 


Write a program that solves a first-order initial-value problem by the Adams—Moulton method, 
calling the RK4TH subroutine of the text to obtain starting values. As each new point is 
computed, monitor the accuracy by comparing the error estimate to a tolerance parameter, 
printing out a message if the tolerance is not met, but continuing the solution anyway. This 
might best be done by incorporating the Adams—Moulton method in a subroutine that advances 
the solution one step. 


The equation y’ = 1 + y?, y(O), has the analytical solution y = tan x. The tangent function 
is infinite at x = 7/2. Use the program you wrote in Problem 56 to solve this equation 
between x = 0 and x = 1.6 with a step size of 0.1. Compare the results of the program with 
the analytical solution. 

What is the behavior of the Runge-Kutta fourth-order method when used over x = 0 tox = 
1.6 on the equation of Problem 57? How does it compare with the multistep methods? How 
does its behavior change with step size? 


Enlarge on Problem 56 by modifying your program so that the step size is halved if the error 
estimate exceeds the tolerance value. Use Eq. (5.23) to interpolate for the new values needed 
when this is done. If the error estimate is very small, say 35 times the tolerance, your program 
should double the step size. (Some programs that provide for doubling the step size keep 
track of how many uniformly spaced values are available and defer the doubling until there 
are seven. You may wish to do this, but it is a tricky bit of programming. Alternatively, one 
could always compute two additional values after discovering that the step size can be 
increased; this guarantees that there are at least seven equispaced values. Another method 
would be to extrapolate to obtain a value at 1; when we find, at 1,,,,, that the error is 3p 
of the tolerance.) 


In an electrical circuit (Fig. 5.7) containing resistance, inductance, and capacitance (and every 
circuit does), the voltage drop across the resistance is iR (i is current in amperes, R is resistance 
in ohms); across the inductance it is L(di/dr) (L is inductance in henries), and across the 
capacitance it is g/C (q is charge in the capacitor in coulombs, C is capacitance in farads). 
We then can write, for the voltage difference between points A and B, 


_ di aes 
Vas = LG, + Rit G. 
R 
AoA 
ane 
Bo— 99g, 
L 
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Figure 5.8 


61. 


62. 


63. 


Differentiating with respect to 1 and remembering that dq/dt = i, we have a second-order 
differential equation: 


oe ai di 1 dv 


If the voltage V4, (which has previously been 0 V) is suddenly brought to 15 V (let us say, 
by connecting a battery across the terminals) and maintained steadily at 15 V (so dV/dr = 
0), current will flow through the circuit. Use an appropriate numerical method to determine 
how the current varies with time between 0 and 0.1 sec if C = 1000 wf, L= 50 mH, and 
R = 4.7 ohms; use Ar of 0.002 sec. Also determine how the voltage builds up across 
the capacitor during this time. You may want to compare the computations with the analytical 
solution. 


Repeat Problem 60 but let the voltage source be a 60-Hz sinusoidal input: 
Vas = 15 sin (12072). 


How closely does the voltage across the capacitor resemble a sine wave during the last full 
cycle of voltage variation? 


After the voltages have stabilized in Problem 60 (15 V across the capacitor), the battery is 
shorted so that the capacitor discharges through the resistance and inductor. Follow the current 
and the capacitor voltages for 0.1 sec, again with Ar = 0.002 sec. The oscillations of decreasing 
amplitude are called damped oscillations. If the calculations are repeated but with the resistance 
value increased, the oscillations will be damped out more quickly; at R = 14.14 ohms the 
oscillations should disappear; this is called critical damping. Perform numerical computations 
with values of R increasing from 4.7 to 22 ohms to confirm that critical damping occurs at 
14,14 ohms. 

In Chapter 2, electrical networks were discussed, but these were simple ones with resistance 
only. A more realistic situation is to have RLC circuits combined into a network. Using either 
the current or voltage laws of Kirchhoff, we can set up a set of simultaneous equations, but 
now these are simultaneous differential equations. For example, for the circuit in Fig. 5.8, 
which has two voltage sources, we have 


ai; diy, ty @i, di, iy 


yt Loa rR ER ee. = Lage Rage ma, es 
@i, di Pap tel, ply he 
(L, + Lae + (Ry + Rn = GtG Lae Ron C; et). 


In the equations, i, and i, represent the currents in each of the loops. Solve the equations for 
i, and i, between ¢ = 0 and t = 0.2 sec, if 


e,(t) = 100 sin (12077), ex(t) = 0, 


eas) i ra i ©) a(t) 


66. 
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given that 


R, = 22 ohms, C, =C, = C; = 10 pf, 
R, = 4.7 ohms, L, = 2.5 mH, 


R;=47ohms, —L, = Ly = 0.5 mH. 


il 


A Foucault pendulum is one free to swing in both the x- and y-directions. It is frequently 
displayed in science museums to exhibit the rotation of the earth, which causes the pendulum 


to swing in directions that continuously vary. The equations of motion are 
¥ — 2w sin Wy + ex = 0, 


¥ + 2w sin Wi + Py = 0, 


when damping is absent (or compensated for). In these equations the dots over the variable 
represent differentiation with respect to time. Here w is the angular velocity of the earth's 
rotation (7.29 = 1075 sec™!), wis the latitude, k? = g/€ where € is the length of the 
pendulum. How long will it take a 10-m-long pendulum to rotate its plane of swing by 45° 
at the latitude where you live? How long if located in Quebec, Canada? 


Condon and Odishaw (1967) discuss Duffing’s equation for the flux ¢ in a transformer. This 
nonlinear differential equation is 


b+ wid + bd = XE cos on. 


In this equation, E sin wr is the sinuscidal source voltage and N is the number of turns in 
the primary winding, while , and b are parameters of the transformer design. Make a plot 
of # versus ¢ (and compare to the source voltage) if E = 165, w = 1207, N = 600, w = 
83, and b = 0.14. For approximate calculations, the nonlinear term bd is sometimes 
neglected. Evaluate your results to determine whether this makes a significant error in the 
results. 
Ethylene oxide is an important raw material for the manufacture of organic chemicals, It is 
produced by reacting ethylene and oxygen together over a silver catalyst. Laboratory studies 
have been reported by Wan (1952). 

It is planned to use this process commercially by passing the gaseous mixture through 
tubes filled with catalyst. Wan’s studies show that the reaction rate varies with pressure, 
temperature, and concentrations of ethylene and oxygen, according to this equation: 


r senior 
r=17% 10e-97161( cg foes 2 le 


where 
r = reaction rate (units of ethylene oxide formed per Ib of catalyst per hr), 
T = temperature, °K (°C + 273), 
P = absolute pressure (Ib/in*), 
Ce = concentration of ethylene, 
Co = concentration of oxygen. 


Under the planned conditions, the reaction will occur, as the gas flows through the tube. 
according to the equation 
dx 


= = 6.42, 
aL 6.42r, 
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where 
x = fraction of ethylene converted to ethylene oxide, 
L = length of reactor tube (ft). 


The reaction is strongly exothermic, so that it is necessary to cool the tubular reactor to 
prevent overheating. (Excessively high temperatures produce undesirable side reactions.) The 
reactor will be cooled by surrounding the catalyst tubes with boiling coolant under pressure 
so that the tube walls are kept at 225°C. This will remove heat proportional to the temperature 
difference between the gas and the boiling water. Of course, heat is generated by the reaction. 
The net effect can be expressed by this equation for the temperature change per foot of tube, 
where B is a design parameter: 


a = 24,320r — B(T — 225). 

For preliminary computations, it has been agreed that we can neglect the change in pressure 
as the gases flow through the tubes; we will use the average pressure of P = 22 Ib/in? absolute. 
We will also neglect the difference between the catalyst temperature (which should be used 
to find the reaction rate) and the gas temperature. You are to compute the length of tubes 
required for 65% conversion of ethylene if the inlet temperature is 250°C. Oxygen is consumed 
in proportion to the ethylene converted; material balances show that the concentrations of 
ethylene and oxygen vary with x, the fraction of ethylene converted, as follows: 


L=-z% 
(e= ¢— 0a7se” 
1 —1.125x 
Co 4=0.375x" 


The design parameter B will be determined by the diameter of tubes that contain the catalyst. 
(The number of tubes in parallel will be chosen to accommodate the quantities of materials 
flowing through the reactor.) The tube size will be chosen to control the maximum temperature 
of the reaction, as set by the minimum allowable value of B. If the tubes are too large in 
diameter (for which the value of B is small), the temperatures will run wild. If the tubes are 
too small (giving a large value to B), so much heat is lost that the reaction tends to be 
quenched. In your studies, vary B to find the least value that will keep the maximum tem- 
perature below 300°C. Permissible values for the parameter B are from 1.0 to 10.0. 

In addition to finding how long the tubes must be, we need to know how the temperature 
varies with x and with the distance along the tubes. In order to have some indication of the 
controllability of the process, you are also asked to determine how much the outlet temperature 
will change for a 1°C change in the inlet temperature, using the value of B as determined 
above. 
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6.0 CONTENTS OF THIS CHAPTER 


In all differential equations of order greater than 1, two or more values must be 
known to evaluate the constants in the particular function that satisfies the dif- 
ferential equation. In the problems discussed in Chapter 5, these several values 
were all specified at the same value of the independent variable, generally at the 
start or initial value. For that reason, such problems are termed initial-value 
problems. 

For an important class of problems, the several values of the function or its 
derivatives are not all known at the same point, but rather at two different values 
of the independent variable. Because these values of the independent variable 
are usually the endpoints (or boundaries) of some domain of interest, problems 
with this type of conditions are classed as boundary-value problems. Determining 
the deflection of a simply supported beam is a typical example, where the 
conditions specified are the deflections and second derivatives of the elastic curve 
at the supports. Heat-flow problems fall in this class when the temperatures or 
temperature gradients are given at two points. A special case of the boundary- 
value problem occurs in vibration problems. 

We will study three different methods for solving boundary-value problems. 
In addition, material on a special class of boundary-value problems, character- 
istic-value problems, is presented. 


Boundary-Value Problems and 
Characteristic-Value Problems 


4\l 
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CHAPTER 6; BOUNDARY-VALUE PROBLEMS AND CHARACTERISTIC-VALUE PROBLEMS 


6.1 


6.1 THE “SHOOTING” METHOD 
Applies the techniques of Chapter 5 to boundary-value problems by assuming values 
needed to make it into an initial-value problem. This involves a degree of trial and 
error, but, for a linear problem, no more than two trials are required. 

6.2 SOLUTION THROUGH A SET OF EQUATIONS 
Uses finite-difference approximations for the derivatives, allowing one to write a 
set of equations whose unknowns are values for the independent variable at several 
points within the domain. The solution to this system gives a solution to the original 
boundary-value problem. 

6.3 DERIVATIVE BOUNDARY CONDITIONS 
Require a modification of the method of Section 6.2; this is straightforward, but it 
does require an artificial extension of the domain. 

6.4 RAYLEIGH-RITZ METHODS 
Use techniques based on the calculus of variations. An approximating function is 
found by optimizing a functional, an entity related to the original differential 
equation. 

6.5 CHARACTERISTIC-VALUE PROBLEMS 
Are an important special class of boundary-value problems that have a solution only 
for certain characteristic values of a parameter, also called eigenvalues. In this 
section, you learn about both eigenvalues and eigenvectors, essential matrix-related 
quantities that have many important applications. 

6.6 EIGENVALUES OF A MATRIX BY ITERATION 
Describes a method for getting eigenvalues and the associated eigenvectors that is 
well adapted to computers. 

6.7 EIGENVALUES/EIGENVECTORS IN ITERATIVE METHODS 
Illustrates how the use of eigenvalues can provide a theoretical basis for explaining 
when an iterative method will converge and how rapidly. 

6.8 CHAPTER SUMMARY 
Is provided to allow you to test your knowledge of the essential points of the chapter. 

6.9 COMPUTER PROGRAMS 
Are given that implement the shooting method for boundary-value problems and 
the power method for eigenvalues/eigenvectors. 


a 


THE “SHOOTING METHOD” 


Suppose we wish to solve the second-order boundary-value problem 


| 


t = 
ae ie 1). =1 x)=2, x3)=-1. (6.1) 


b) 

Note that if x'(1) were given in addition to x(1), this would be an initial-value 
problem. There is a way to adapt our previous methods to this problem, as illustrated in 
Fig. 6.1. We know the value of x at r = 1, and at t = 3, as given by the dots. The curve 


Figure 6.1 


6.1: THE “SHOOTING” METHOD 413 


wea Slope at ¢ = | is unknown 
\ 


that represents x between these two points is desired. We anticipate that some such curve, 
such as the dotted line, exists;* its slope and curvature are interrelated to x and 1 by the 
differential Eq. (6.1). 


If we assume the slope of the curve at t = 1, say x’(1) = —1.5, we could solve the 
equation as an initial-value problem using this assumed value. The test of our assumption 
is whether we calculate x at t = 3 to match the known value, x(3) = —1. In Table 6.1 


we show the results of this computation employing the computer program of Section 5.13 
(Fig. 5.3), modified for this equation, using the modified Euler method with Ar = 0,2. 
Since the result of this gives x(3) = 4.811, and not the desired x(3) = —1, we assume 
another value for x'(1) and repeat. Since the calculated value of x(3) is too high, we 
assume a smaller value for the slope, say x'(1) = —3.0. The second computation gives 
x(3) = 0.453, which is better but still too high. After these two trials, we linearly 
interpolate (here extrapolate) for a third trial. At x’(1) = —3.500, we get the correct 
value of x(3). In some problems more attempts are needed to get the correct solution, 

The method we have illustrated is called the shooting method because it resembles 
an artillery problem. One sets the elevation of the gun and fires a preliminary round at 
the target. After successive shots have straddled the target, one zeroes in on it by using 
intermediate values of the gun’s elevation. This corresponds to our using assumed values 
of the initial slope and interpolating based on how close we come to x(3). 

You may wonder whether it was only a lucky accident that gave correct results on 
our third trial, using the initial slope x'(1) = —3.500, extrapolated from our earlier 
guesses. This desirable result will always be true when the boundary-value problem is 
linear, as in this case. (A differential equation is linear when the coefficients of each 
derivative term and the function are not functions of x. They may be functions of 1, as 
in this example, however.) The reason for this is that, if there are two different functions 
of x that satisfy the differential equation (but agreeing with different boundary conditions, 
of course), then a linear combination of them is also a solution. This is quite easy to 
show. 


*The existence of a solution to a boundary-value problem cannot be taken for granted, however. For example, 
the problem y’ + y = 0, y(0) = 1. y(7) = 0 has no solution. 
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Table 6.1 


Solving a second-order equation by the shooting method: 
x"=t+ (1 = 3x x(I)=2, x)= -I 


Time Distance Velocity Distance Velocity Distance Velocity 

2: Assume x'(1) = —1.5 Assume x'(1) = —3.0 Assume x'(1) = —3.500 
1.00 2.000 —1.500 2.000 —3.000 2.000 —3.500 
1.20 1.751 —0.987 1.499 —2.510 1.348 —3.018 
1.40 1.605 —0.478 0.991 —2.068 0.787 =—2.599 
1.60 1,56] 0.043 0.619 =1,655 0.305 —2.221 
1,80 1.625 0.594 0.328 —1.252 —0.104 —1.867 
2.00 1.803 1.186 0.118 —0.844 —0.443 =1.521 
2.20 2.105 1.832 —0.007 -0.417 -0.712 1.167 
2.40 2.542 2.542 —0.045 0.040 —0.908 0.794 
2.60 3.128 3.324 0.013 0.539 —1.026 —0.391 
2.80 3.880 4.185 0.175 1,087 — 1.060 0,054 
3.00 4.811 5.128 0.453 1.693 —1.000 0.547 


Suppose that x,(t) satisfies x” + Fx’ + Gx = H, where F, G, and H are functions 
of 1 only. Suppose also that x(t) is a solution. Then 


_ CX + 2X2 
cy + C2 


will be a solution. We show that this will always be a solution in the following manner: 
Since x, and x are solutions, 


x, + Fx; + Gx, = H, 
and 
x3 + Fx; + Gx, = H. 


Substituting y into the differential equation, with 


» _ Oxy + 2x5 » 0] + Oo 
y= : SS 
cy + C2 c) +c 
we get 
” n , ' 
eyxy + €2x5 yx, + cox CyX, + Cx. 
1X} 22 4 pO1% 22, gi 2X2 
cy + 2 cy + cy + cz 
_ cx + cyF xy + c,Gxy ey 2X3 + C9F x4 + CyGx 
cy, + ec pct Cp 
cH oH 
1 ree = 


te te 
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With two functions x, and x, both agreeing with x(1) = 2.0 at the left boundary and 
differing at ¢ = 3.0, we can take a linear combination, so the correct value of x(3.0) 


results. 


first guess at initial slope; 


second guess at initial slope; 
first result at endpoint (using G1); 
second result at endpoint (using G2). 


With D = the desired value at the endpoint, we can write 


Extrapolated estimate _ 


for initial slope 


In the above example, we get 


(33.0) =" 1.5) 


“ES + "9453 — 4.811 


(-1.0 — 4.811) = —3.500. 

The consequence of having (c,x, + c2x)/(c; + cz) as a solution of the differential 
equation goes further than just allowing us to interpolate from the two initial guesses of 
x’(1) to predict exactly the correct value required for x’ (1) that is needed in order to give 
agreement with the right-hand boundary condition. This relationship is true throughout 
the interval as well. This means that each of the values of x could be calculated by taking 
the proper combination of the values obtained from the earlier calculations, with this 
combination exactly the same one as that which gives the correct value of x at ¢ = 3. 
(The true curve for x versus 1 will be intermediate between two curves that bracket the 
final condition.) 

In our example, 


cx, + cpx = true values. 
At left end, 
t=1: (2.0) + c(2.0) = 2.0. 
At right end, 
t=3: (4.811) + c(0.453) = — 1.0. 
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6.2 


SOLUTION THROUGH A SET OF EQUATIONS 


We have seen in Chapter 4 how the derivatives of a function can be approximated by 
finite-difference quotients. If we replace the derivatives in a differential equation by such 
expressions, we convert it to a difference equation whose solution is an approximation 
to the solution of the differential equation. This method is sometimes preferred over the 
shooting method discussed earlier. Consider the same linear example as before: 


2. 
a (1-d)e=s x) =2, x3) = -1. (6.2) 


Central-difference approximations to derivatives are more accurate than forward or 
backward approximations (O(h?) versus O(h)), so we replace the derivatives with 


The quantity h is the constant difference in -values. Substituting these equivalences 
into Eq. (6.2), and rearranging, we get 


2S i (eae ee t, 
wit $81 (| Hy my 


he 
Xa. [2 + a - ale + X41 


In Eqs. (6.3), we have replaced x with x, and r with 1, , since these values correspond 
to the point at which the difference quotients represent the derivatives.* Our problem 
now reduces to solving Eqs. (6.3) at points in the interval from t = 1 to t = 3. Let us 
subdivide the interval into a number of equal subintervals. For example, if h = Ar = i, 
the points 


Ht, (6.3) 


h=1, H=b5, B=2, |=2.5, and t5=3 


subdivide the interval into four subintervals. 


*Our procedure is to subdivide the interval from the initial to the final value of 1 into equal subintervals, 
replacing the differential equation with a difference equation at each of the discrete points where the function 
is unknown. 


cooooso 
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We write the difference equation in (6.3) for each of these values of ¢ at which x is 
unknown, giving 


go a a 


Ed 
| 


att = t; = 2.0: 


ov 
i} 
— — — 
ie 
+ 
— 
21 = 
elle 
— 
I= 
ue 
— 
| 
al . 
° 
oe 
i= 
a 
+ 
* 
Ps 
i] 
— 
IS 
Se 
—_~ 
PIS 
Sel 
p 
Se 


2+ ()Q0-3)era-G 


We know x, = 2 and x5 = —1; hence the difference equation is not written corre- 
sponding to ¢= 1 or ¢ = 3. Substituting the values x, = 2 and xs; = —1 and simplifying, 
we have, in matrix form, 


at t = ty = 2.5: 


- 
a 
| 


—-2.175 1 0 Xy ~1;625 
1 —2.150 1 x3}=] 0.5 
0 1 —2.125 jLxy 1,625 


Solving, we get 


x2 = 0.552, x; = —0.424, and = x4 = —0.964, 


These are reasonably close to the values obtained by the shooting method. We would 
expect some significant error because our step size h is large and the finite differences 
will be poor approximations to the derivatives. If we take h = 0.2 (10 subdivisions) and 
write the approximating equations, we get 


1 0 0 0 0 0 0 0 —1.952 
—2.0288 1 0 0 0 0 0 0 0,056 

I —2.0272 1 0 0 0 0 0 0,064 

0 1 —2.0256 1 0 0 0 0 0.072 

0 0 1 —2.0240 1 0 0 0 x=] 0.080]. (6.4) 

0 0 0 1 —2.0224 1 0 0 0.088 

0 0 0 0 1 —2.0208 1 0 0.096 

0 0 0 0 0 1 —2.0192 1 0.104 

i) 0 0 0 0 0 1 —2.0176 1.112 


We observe that the system is tridiagonal and therefore speedy to solve and also 
economical of memory space to store the coefficients. This will be true even if there are 
many equations, because we only use x;_,, x;, X;,; in any equation to replace x’ or x", 
This is one reason why the finite-difference method is widely used to solve second-order 
linear boundary-value problems. 
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6.2 


SOLUTION THROUGH A SET OF EQUATIONS 


We have seen in Chapter 4 how the derivatives of a function can be approximated by 
finite-difference quotients. If we replace the derivatives in a differential equation by such 
expressions, we convert it to a difference equation whose solution is an approximation 
to the solution of the differential equation. This method is sometimes preferred over the 
shooting method discussed earlier. Consider the same linear example as before: 
2 
(1 -ir=s, x1)=2, 23) =-1. (6.2) 


Central-difference approximations to derivatives are more accurate than forward or 
backward approximations (O(h*) versus O(h)), so we replace the derivatives with 


= 7i+1 — %i-1 


2 
Th + O(h*), 


eee EE 2 + HHI + OUP). 


The quantity h is the constant difference in r-values. Substituting these equivalences 
into Eq. (6.2), and rearranging, we get 


= 2x, + 
fist he Hat (1) = 4, 


Xi-17 [2 + (1 - ae + X41 


In Eqs. (6.3), we have replaced x with x; and ¢ with 1; , since these values correspond 
to the point at which the difference quotients represent the derivatives.* Our problem 
now reduces to solving Eqs. (6.3) at points in the interval from f = 1 to f = 3. Let us 
subdivide the interval into a number of equal subintervals. For example, if h = Ar = i, 
the points 


Wt. (6.3) 


h=1 H=b5, B=2, %=2.5, and t5=3 


subdivide the interval into four subintervals. 


*Our procedure is to subdivide the interval from the initial to the final value of ¢ into equal subintervals, 
replacing the differential equation with a difference equation at each of the discrete points where the function 
is unknown, 
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We write the difference equation in (6.3) for each of these values of ¢ at which x is 
unknown, giving 


winnie 2+ QQ0—)} 


+ 

ot 

i] 
— 
NIK 
Ee 
— 
Ni- 
5 See: 

in 


0 1\(1 
atr=t;= 2.0: x» - [2 + (5)(5)(1 = aie + x4 = (5)G)eo. 
2. 1\/1 
atr=%=2.5: x»%- [2 + (S)(S)(1 - 23) hg + XxX5 = (5)(s)e9 
We know x, = 2 and x5 = —1; hence the difference equation is not written corre- 
sponding to ¢ = | or ¢ = 3. Substituting the values x, = 2 and xs; = —1 and simplifying, 


we have, in matrix form, 


Solving, we get 


x) = 0.552, 43 = —0.424, and x4 = —0.964. 


These are reasonably close to the values obtained by the shooting method. We would 
expect some significant error because our step size h is large and the finite differences 
will be poor approximations to the derivatives. If we take h = 0.2 (10 subdivisions) and 
write the approximating equations, we get 


—1.952 
0,056 
0,064 
0.072 
0.080]. (6.4) 
0.088 
0.096 
0.104 
176 1.112 
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We observe that the system is tridiagonal and therefore speedy to solve and also 
economical of memory space to store the coefficients. This will be true even if there are 
many equations, because we only use x;_;, x), X;,) in any equation to replace x' or x". 
This is one reason why the finite-difference method is widely used to solve second-order 
linear boundary-value problems. 
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Table 6.4 Results for linear boundary-value problem: 


EXAMPLE 


x’=ct+ (i- x x(I)=2, x3) =-I 


Values from finite- Values from 
t difference method shooting method 
1.0 2.000 2.000 
1.2 1.351 1.348 
1.4 0.792 0.787 
1.6 0.311 0.305 
1.8 —0.097 —0.104 
2.0 —0.436 —0.443 
2.2 —0.705 0.712 
2.4 —0.903 —0.908 
2.6 —1.022 —1.026 
2.8 —1.058 —1.060 
3.0 —1.000 —1,000 


If we use the subroutine TRIDG from Chapter 2 to compute the solution to Eqs. 
(6.4), we get the values shown in Table 6.4, where we also display the results by the 
shooting method from Section 6.1 for comparison. 

The values obtained with Ar = 0.2 are more accurate. We might anticipate that the 
error would decrease proportionately to A? since the approximations to the derivatives 
are of O(h?) error. If we use the values from the shooting method as the exact values 
for comparison, the errors are reduced only about threefold (comparing the values at 
t = 2) instead of sixfold. However, the shooting-method calculations are themselves im- 
perfect, and this adds an element of doubt to our observation. To observe how the 
errors decrease when Ar decreases, we look at another example, one where the exact 
solution is known. 


Solve 


fy =y, yl) = 1.1752, y(3) = 10.0179. 
(The analytical solution is y = sinh x, to which we can compare our estimates of the 
function.) 

While it is not strictly necessary, it is common to normalize the function to the interval 
(0,1).* This we can do by the change of variable 


x=(b-aytt+a, 


*Perhaps the most important reason is to make a computer program more general. Normalizing is really a 
scaling of the interval so we have an immediate understanding of whether a given value of h is “small” or 
“large.” Also, when we compute the LU equivalent of the coefficient matrix to solve the system, it is then of 
more general applicability. 


Table 6.5 


6.2: SOLUTION THROUGH A SET OF EQUATIONS 421 


where (a, b) is the original interval to be normalized. Letting x = 21 + 1, we write 


dy _ dydt _ \dy 
dx dtdx 2dr’ 
dy d(ja)e - 59 


dx? dt\Qdt)dx Pdr’ 


and the problem becomes 


G =4y, (0) = 1.1752, (1) = 10.0179. 
Replacing the second derivative by the central-difference approximation, and thus 
converting the differential equation to a difference equation, we have 


YY; + Yi-) 
he 


Yi+1 — 


=4y, %i=1,2,3,...,7. 


Subdividing the interval (0, 1) into four parts, h = 0.25, and writing the difference 
equation at the three internal points where y is unknown, we have the set of equations 


-2.25 1 0 Tr -1.1752 
1-225 1 Wy] = 0. |. (6.5) 
0 1 = =2.25]ly,} [-10.0179 


In Eqs. (6.5) we have used y, = 1.1752, ys = 10.0179. Again, a tridiagonal matrix 
of coefficients occurs. The solution to this set of equations is given in Table 6.5. 


a re 
Exact value, 


t oe Calculated y sinh x Error 
0 1.0 (1.1752) 1.1752 _ 
0.25 1.5 2.1467 2.1293 —0.0174 
0.50 2.0 3.6549 3.6269 —0.0280 
0.75 2:5 6.0768 6.0502 —0.0266 
1.0 3.0 (10,0179) 10.0179 _ 


If we recalculate this problem using varying values of h, we get the results shown 
in Table 6.6 for y at x = 2.0 (equivalent to r = 0.5). 

We observe an approximate fourfold decrease in the error when the step size is 
halved, meaning an O(h?) error does exist. When we know how errors vary with step 
size, we can perform an extrapolation similar to that for Romberg integration in Chapter 
3, using results for which the step size is halved: 


Improved _ More , 1 ( More ) _ (Less ) 
value accurate 3] \accurate accurate / |" 
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Table 6.6 


I 2 
x1 = [2 - (1 = Sais = oan) + x4, = hh, 3 


~ah = Ar for 
Number of normalized 
equations interval Result Error 
1 0.50 3.7310 —0.1041 
3 0.25 3.6549 —0.0280 
7 0.125 3.6340 —0.0071 
x 0 3.6269 (exact) 


In this instance we get 3.6340 + 3(3.6340 — 3.6549) = 3.6270, when results with 
At = 3 and Ar = i are combined. = 


When the differential equation underlying a boundary-value problem is nonlinear, 
this method of finite differences runs into a problem in that the resulting system of equations 
is nonlinear. Consider the nonlinear problem solved by the shooting method in Section 


6.1: 
a (-3)(G 


When finite differences are substituted for the derivatives, we obtain an equivalent 
set of difference equations as follows: 


Joo =f, x1) =2, 


t, 


Ni 2, Syn meet OD 


which is to be written at each point f, at which x is unknown. In the middle term, products 
of the x’s occur, exhibiting nonlinearity in the difference equations as a consequence of 
the nonlinearity in the differential equation. The standard elimination method fails for 
these equations. 

In such cases, iteration techniques can be employed. Suppose we obtain some esti- 
mates of the x-vector, the solution to the equations. We could use these values as a means 
of approximating the coefficient of x; in Eq. (6.6). (Since the values are multiplied by 
h/2, we might hope that errors of the estimate would be diluted in their effect, especially 
if A is fairly small, in comparison to the number 2, which should dominate in the coefficient 
of x;.) 

When this was done, using h = 0.2 and the initial x; values estimated from x = 
3.5 — 1,5¢ (a linear relation between x(1) = 2 and x(3) = —1), the values in Table 
6.7 resulted. 

Further iterations after the eighth did not change the values. In Table 6.7, we have 
listed the results by the shooting method for comparison. In both techniques, an iterative 
process was used; in the finite-difference method, we iterate on the approximations to x, 
while in the shooting method, we use successive estimates to the initial slope. For this 


Table 6.7 


6.3 
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Successive approximations to finite-difference equations for 
x= (1 = §) 00 =f xI)=2 x3)=-l 


Approximations to x-vector 


Results from 


Initial After | After 2 After 4 After 8 shooting 

t estimate iteration iterations iterations iterations method 
1.0 2.000 (2.000) (2.000) (2.000) (2.000) 
1B, 1.700 1.601 1.558 1.552 1.557 
1.4 1.400 1.110 1.048 1.039 1.053 
1.6 1,100 0,583 0.513 0.503 0.527 
1.8 0.800 0.077 0,006 —0.005 0.026 
2.0 0.500 —0.362 —0.430 —0.440 —0.408 
22 0.200 —0.704 —0.766 -0.775 —0.747 
2.4 —0.100 =0:937 —0.990 —0,998 —0.977 
2.6 -0.400 —1.061 1,101 —1.107 —1,094 
2.8 —0.700 —0.080 —1.104 —1.107 1.101 
3.0 —1.000 (—1.000) (—1.000) (—1.000) (—1.000) 


example, about the same number of iterations were required, and roughly the same 
accuracy was achieved. (The correct solution is intermediate between the results for the 
two methods shown in Table 6.7.) 

The choice between the finite-difference method and the shooting method for nonlinear 
boundary-value problems is not clearcut. The choice depends on whether a reasonably 
good initial estimate for the x-vector is available and how strongly nonlinear the equation 
is. When the nonlinearity in the set of equations is not diluted in its effect and when no 
good initial estimate is available, iterative methods applied to the algebraic finite-difference 
equations may not even converge. Although the shooting method often involves a some- 
what greater computational effort, it is usually more certain of giving a solution. 


DERIVATIVE BOUNDARY CONDITIONS 


The conditions that the solution of a differential equation must satisfy need not necessarily 
be just the value of the function. In many applied problems, some derivative of the 
function may be known at the boundaries of an interval. In the more general case a linear 
combination of the function and its derivatives is specified. The finite-difference procedure 
needs modification for this type of boundary conditions.* We illustrate by an example: 


*With the shooting method, derivative boundary conditions do not require any change in procedure, but we 
will need to use finite-difference approximations to know when the results match with specified derivatives at 
the far end. 
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y’ (1) = 1.1752, 
y’) = 10.0179. 


(This problem has the same differential equation as the example of Section 6.2 but, with 
the values specified for the derivative at x = 1 and x = 3, it now has the analytical 
solution y = cosh x.) 
We begin just as before. We change variables to make the interval (0, 1) by letting 
x=2t+1: 
diy, 
ae dy. 


We now replace the derivative by a central-difference approximation and write the 
difference equation at each point where y is unknown. With h = 0.25, 


Two more equations are required than in Section 6.2 because y is unknown at t = 
0 and t = | as well as at the interior points. These two equations involve the values y¢. 
Y,. points one space to the left and to the right of the interval (0, 1). We assume that the 
domain of the differential equation can be so extended. 

Our problem is now that Eqs. (6.7) contain seven unknowns, and we have only five 
equations. The boundary conditions, however, have not yet been involved. Let us also 
express these as difference quotients, preferring central-difference approximations of O(h) 
error as used in replacing derivatives in the original equation: 


a = oe = Ga = 1.1752, ye = yp — (4)(0.25)(1.1752), 
xl ro MS 

a -a4 = GG) = 10.0179, y, = ys + 4)(0.25)(10.0179). (6.8) 
3 Fe ie 


The relations of (6.8), when substituted into (6.7), reduce the number of unknowns 
to five, and we solve the equations in the usual way. The solution is given in Table 6.8. 


Table 6.8 
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t ¥ y cosh x Error 

0 1.5522 —0.0091 
0.25 2.3338 0.0186 
0.50 3.6989 0.0633 
0.75 5.9887 0.1436 
1.00 9.7757 0.2920 


We observe here that the errors are much greater than for the previous example, 
being very large at x = 3.0. The explanation is that our approximation for the derivative 
is poor at large x-values. Although the central-difference approximations are O(h?), the 
third derivative appearing in the error term is large in magnitude. In the previous example, 
knowing the function at the endpoints eliminated the errors there. 

Repeating the calculations with the step size cut in half reduces the errors about 
fourfold. For example, at = 0.5, we get 3.7459 (error 0.0163 versus 0.0633); and at 
t = 1.0, we get 9.9921 (error 0.0756 versus 0.2920), Halving h again gives another 
fourfold reduction in errors (y = 10.0486 at 1 = 1.0, error of 0.0191). An extrapolation 
based on errors of O(h?) will give nearly perfect results. 

It is appropriate to summarize the finite-difference method for solving a boundary- 
value problem. 


The Finite-Difference Method for Boundary-Value Problems 


To solve a boundary-value problem, replace the differential equation with a finite- 
difference equation by replacing the derivatives with central-difference quotients. 
Subdivide the interval into a suitable number of equal subintervals, and write 
the difference equation at each point where the value of the function is unknown. 
When the boundary values involve derivatives, this will require that the domain 


be extended beyond the interval. Utilize the derivative boundary conditions to 
write difference quotients that permit the elimination of the fictitious points 
outside the interval. 

Solve the system of equations so created to obtain approximate values for 
the solution of the differential equation at discrete points on the interval. If the 
original differential equation is nonlinear, the system of equations will also be 
nonlinear. In such situations, the shooting method will normally be preferred. 


We could, of course, use more accurate finite-difference approximations to the deriv- 
atives, not only for the boundary values, but for the equation itself. The disadvantage of 
doing this is that the system of equations is then not tridiagonal, and the solution is more 
difficult to arrive at. 
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6.4 


If the more general form of boundary condition applies, ay + by’ = c, equations 
similar to (6.8) result, and after the boundary conditions have been approximated and 
the exterior y-values eliminated from the set of equations, the solution of the problem 
proceeds as before. 

Equations of order higher than the second will involve approximations for the third 
or higher derivatives. Central-difference formulas will involve points more than h away 
from the point where the derivative is being approximated, and may be unsymmetrical 
in the case of the odd derivatives. Probably the method of undetermined coefficients is 
the easiest way to derive the necessary formulas (see Appendix B). Again, nontridiagonal 
systems result. Fortunately, most of the important physical problems are simulated by 
equations of order 2. 

Besides the two methods presented in this chapter there are others that are commonly 
referred to as Rayleigh—Ritz, collocation, and finite-element methods. These approximate 
the solution to the boundary-value problem, y(x), by writing it as a linear combination 
LejP(x), i= 1, 2, ... , n, where the ¢;’s are specially chosen functions. These methods 
have been found to be very effective for certain kinds of problems. We will introduce 
you to one of these methods in the next section. 


RAYLEIGH-RITZ METHODS 


In addition to the previous two methods for boundary-value problems, you should know 
something about the Rayleigh—Ritz method. It is based on an elegant branch of mathe- 
matics, the calculus of variations. In this method we solve a boundary-value problem by 
approximating the solution with a finite linear combination of simple basis functions that 
are chosen to fulfill certain criteria, including meeting the boundary conditions. 

The calculus of variations seeks to optimize (often minimize) a special class of 
functions called functionals. The usual form for the functional (in problems of one 
independent variable) is 


Observe that /[y] is not a function of x because x disappears when the definite integral 
is evaluated. The argument y of /[y] is not a simple variable but a function, y = y(x). 
The square brackets in /[y] emphasize this fact. A functional can be thought of as a 
“function of functions.” The value of the right-hand side of Eq. (6.9) will change as the 
function y(x) is varied, but when y(x) is fixed, it evaluates to a scalar quantity (a constant). 

Let us illustrate this concept by a very simple example where the solution is obvious 
in advance—find the function y(x) that minimizes the distance between two points. While 


a 
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we know what y(x) must be, let’s pretend we don’t. The figure suggests that we are to 
chose from among the set of curves y,(x) of which y;(x), y2(x), and y3(x) are representative. 


P= (in) 


Ys) 


(x) 
P, = (X» ya) 


In this simple case, the functional is the integral of the distance along any of these curves: 


x | 
I[y] = i Vidxye + (dyP (dy? = I 1+ Fife 
ey 


To minimize /[y], just as in calculus, we set its derivative to zero. There are certain \ 
restrictions on all the curves y,(x). Obviously each must pass through the points (x), y)) 

and (x, y2). In addition, for the optimal trajectory, the Euler-Lagrange equation must 

be satisfied: 


d 
= == yy 6.10 
[2 PRE Vx v)| 5 Fle y, y’). (6.10) | 


Applying this to the functional for shortest distance, we have 


F(x, y, y') = (1 + (y'?)"7, 
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From this, it follows that i 


Sess Ee 
VAG): 
Solving for y’ gives | 
; te | 
y= \/p— 2 = 4 constant = b, 
and, on integrating, 
y=bxt+a. 


As stated above, y(x) must pass through P, and P); this condition is used to evaluate 
the constants a and b. 

Let us advance to a less trivial case. Consider the second-order linear boundary- 
value problem over [a, 5], 


y” + Oy = F(x), ya) = 0, y(b) = 0. (6.11) 


The equation is subject to homogeneous Dirichlet conditions. It turns out that the functional 
that corresponds to Eq. (6.11) is 


‘b 2 
Mul = [ [(“) =Gi?'+ 2Fu| ties (6.12) 


We can transform Eq. (6.12) into Eq. (6.11) through the Euler-Lagrange conditions, 
so optimizing Eq. (6.12) gives the solution to Eq. (6.11). Observe carefully what benefits 
result from operating with the functional rather than the original equation—we have now 
only first-order derivatives instead of second-order ones. This not only simplifies the 
mathematics but permits finding solutions even when there are discontinuities that cause 
y to not have sufficiently high derivatives. (There is another approach that shows the 
equivalence—multiply Eq. (6.11) by the function u and then use integration by parts on 
the first term.) If the differential equation has boundary conditions that involve the deriv- 
ative of y, the functional in Eq. (6.12) must be modified. 

If we know the solution to our differential equation, substituting it for u in Eq. (6.12) 
will make /[u] a minimum. When the answer isn’t known, perhaps we can approximate 
it by some (almost) arbitrary function and see if we can minimize the functional by a 
suitable choice of the parameters in the approximation. The Rayleigh—Ritz method is 
based on this idea. We assume that u can be approximated by a linear sum of functions: 


n 
wey yt Cy ++ +s bey, = >» CpV;- 


We substitute u and du/dx from this into Eq. (6.12) to get 


-b 2 
Kens se [ [(é z cw) = OC ey)? + 2F > an | dx. 


EXAMPLE 
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At this point, observe that / is an ordinary function of the unknown c;’s; our notation 
reflects this. To get the optimum, we differentiate partially with respect to each c; in turn, 
setting each to zero. This gives, when the integrals are evaluated, a linear system in the 
c;’s that we solve by the usual methods. 


Solve y" = 6x — 6, subject to y(0) = 0, y(1) = 0. 


The functional is 
"1 (du\? 
Tu) = [ [(4) + 2(6x — 6)u} dx. 


Assume u = a + bx + cx? (here v; = 1, v) = x, v3 = 2°). To satisfy the boundary 
conditions, we take a = 0, b = —c, so 
u = c(x) (x — 1) = ec? — x), 
di 
tH — eax - 1), 


(2x — 1). 


i] 


(z) 
dx 
The second term of the integrand is 


26x — 6)(u) = 26x — 6)(c)(x? — x) 


= 12c(x3 — 2x? + x). 
Now we have 


1 1 
Ke) = I, (2x — 1? dx + I, 12e(x3 — 2x? + x) dx. 


Differentiate with respect to c and set to zero: 


al 1 1 
= =0= ef (2x — 1)7(2) de + 2[ (3 - 27 + x) dx. 
dc 0 10 


On evaluating the integrals we find that 


n= Jae =i, 


Our example is very easy because there is only one unknown coefficient. If there were 
N of these, a set of N linear equations in the N coefficients would result from setting the 
several partial derivatives of / to zero. 

Table 6.9 compares this result to the analytical answer, y = 2x — 3x2 + 23. 
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Table 6.9 


6 ~*~ Exact u(x) = (3/2)(x)(x — 1) 


0 0 0 
0.2 0.288 0.240 
0.4 0.384 0.360 
0.6 0.336 0.360 
0.8 0.192 0.240 
1.0 0 0 


We leave it as an exercise for you to show that using a cubic polynomial for u gives 
a perfect match to the analytical answer. = 


The difficulty with this classical form of the Rayleigh—Ritz method is that the proper 
form of approximating function is often not easy to find; polynomials are commonly used 
but may be unsuitable. The trouble lies in trying to fit a complex function over the entire 
interval [a, b]. But we have previously seen how we can overcome this problem—even 
low-degree polynomials can approximate any function if applied over a small enough 
interval. We now examine the application of Rayleigh—Ritz to subintervals of [a, b]. 

What we do here is equivalent to interpolating between values of the solution to the 
differential equation at distinct points within [a, b]. If we knew values for y at these 
points, we could interpolate to get y at points between; if we assume a linear relation, 
we can get y in the subinterval [x;, x;,,] from y, and y,,, by defining 


yx) = Ny; + Nis ita with 
(x — x)-)/hj-, on [%;-1, x] 
Nix) = 4 i+ — D/hj on [x;, X41] 
0 elsewhere, 


where h, = |x;, — x;,|. In effect, y within the interval is a weighted average of the values 
at the ends. We will take A as a constant in the following. 


Here is a summary of some of the properties of these weighting functions; 


lifi=j 

0 otherwise (if i # j), 

2. N(x) * N(x) = 0 for {i — j] > 1 (intervals other than those that have x, as 
an endpoint), 

3. Ni(x) * Nj (x) = 0 for [Peet | Rear 


4. Nid) * Nini) = ee ee ee 


1. NiGy) -{ 


—1/h? on [x;, +) 


5. MiG) * Nii) = ti elsewhere. 


EXAMPLE 
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This sketch shows the appearance of the N,(x): 


N,, 


\ A A A 


We will approximate the solution to the differential equation by substituting into the 
functional for u(x): 


n 
u(x) = > ciN (x). 
ie 
When we do so and minimize by setting derivatives with respect to c, equal to zero, the 
values of the c;’s will be approximations to y at the nodes within [a, b]. 


Solve y" — (1 — x/5)y = x, y(1) = 2 and y(3) = —1. (This is the same problem as in 
Sections 6.1 and 6.2, but the variables have been renamed.) 

Subdivide the interval [1, 3] into four equal subintervals, making h = 0.5. The N,(x) 
functions look like this: 


No N, N, Ny N, 


x= 10 is 2.0 25 3.0 
(Observe that Np and N, are different from the others. This is required to handle the 
nonhomogeneous end conditions.) We approximate the function with 
VX) = CoNo(x) + €yNy(x) + C2N3(x) + 03N3(x) + c4Nq(x) 


(cg will be 2 and cy will be —1 to satisfy the end conditions, but we postpone this 
substitution.) On substituting into the functional, we have 


3 
M(cg,-- - 5 €4) = i [(cgNo + yj + c2N3 + cxN3 + cgi)? 
+ (1 ~ §)icoN + cyNy + cyNy + €3N3 + c4Nq)? 
+ 2x(cgNo + c)N, + c2Nz + €3N3 + c4Ny)] dx 


We will set the derivatives of / with respect to c;, i = 1, 2, 3 to zero. We do not do this 
for cg and cy because they are fixed. We also substitute the known values for co and c4. 
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This gives 
Ac =.b, 
where 


3 
Aq = if [sowwien + (1 - 2\ncone| dx, 


by 


3 
= | XM) de. 


Some example terms that you should verify are 


1s x 2.0 x) 
aa] (1-2)0- wart [ (1 -$)@- 2 ar, 
1.0 5 1S 5 


AC, 1) 


20 
ACL, 2) = -2 + af (1 2 sje = x(x — 1.5) dr, 
LS 5 


5 2.0 
bay = -2 | x= ax ~ 2 | x(2 — x) dx. 
10 LS 
We used Simpson's method to evaluate the integrals, getting 
4.2333 -1.9458 0.0 3.125 
A= |] -1.9458 4.1999 —1.9542], b=]-1.0 F 
0.0 —1.9541 4.1667 —3.2125 


which has the solution 


—0.5323 
c = | —0.4480]. 


—0.9811 
This follows from the fact that we set cg = 2 and cy = —1 in the system 
2.0 
—1.9375 4.2333. -1.9458 0.0 0.0 Cy =, 75 
0.0 —1.9458 4.1999 -1.9542 0.0 © = |-1.0 
0.0 0.0 —1.9541 4.1667 —1.9625 C3 125) 
10) 


Now we have an approximation to the solution of our differential equation: 
y(x) = 2No(x) — 0.5323N,(x) — 0.4480N2(x) — 0.9811N3(x) — N4(Qx). 
The nodal values are quite obvious: 


(1.5) = 0.5323, 
(2.0) = —0.4480, 
(2.5) = —0.9811, 


but we can also use the equation for y to compute for any point in [1, 3]. = 


- 


Table 6.10 


6.5 
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er es 
Rayleigh-Ritz method Runge—Kutta—Fehlberg 


x y y ) 

(h= 0.50 0.25 0.25) 
1.25 (1.2262)* 1.2033 1.2011 
1.50 0 3 0.5381 0.5400 
Er) (0.0422)* —0.0064 —0.0041 
2.00 —0.4480 —0.4408 —0.4385 
2.25 (-0.7146)* —0.7659 —0.7637 
2.50 —0.9811 —0.9758 —0.9741 
2.75 (—0.9906)* — 1.0593 —1.0583 
“Interpolated 


We repeated this same application of Rayleigh—Ritz to subintervals of [1, 3] but with 
h = 0.25. This produced nine weighting functions instead of five and allowed the com- 
putation of more accurate values for the solution. Table 6.10 summarizes the results and 
compares them to those from Runge—Kutta—Fehlberg. 

What we have done above is really an application of the finite-element method. 
Basically, this breaks up the region of a boundary-value problem into subintervals and 
applies a variational method within the subintervals. There are alternative ways to derive 
the equations that produce the linear system of equations whose solutions are the approx- 
imations to the solution of the boundary-value problem at the nodes. One of these is the 
Galerkin method, which, for problems of the same type as our example, results in exactly 
the same set of equations. 


CHARACTERISTIC-VALUE PROBLEMS 


Problems in the fields of elasticity and vibration (including applications of the wave 
equations of modern physics) fall into a special class of boundary-value problems known 
as characteristic-value problems. (Certain problems in statistics also reduce to such 
problems.) We discuss only the most elementary forms of characteristic-value problems 
here. 

Consider the homogeneous* second-order equation with homogeneous boundary 
conditions: 


“Y+Ry=0, y0)=0, yy=0, (6.13) 


*Homogeneous here means that all the terms are alike in being functions of y or its derivatives. 
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where /* is a parameter. We first solve this equation nonnumerically to show that there 
is a solution for only certain particular, or “characteristic,” values of the parameter. The 
general solution-is. 


y =asinkx + bcos ky, 


which can easily be verified by substituting into the differential equation; the solution 
contains the two arbitrary constants a and b, because the differential equation is second- 
order. The constants a and b are to be determined to make the general solution agree 
with the boundary conditions: 

At x = 0, y = 0 =a sin(0) + b cos(0) = b. Then b must be zero. Atx = 1, y= 
0 = a sin(k);-we-may_have either-a = 0 or-sin(k) = 0 to satisfy this condition. The 
former leads to y(x) = 0, which we call the trivial solution. This function, y everywhere 
zero, will always be a solution to any homogeneous differential equation with homoge- 
neous boundary conditions. (The trivial solution is usually of no interest.) To get a 
nontrivial solution, we must choose the other alternative, sin(k) = 0, which is satisfied 
only for certain values of k, the characteristic values of the system. The solution to Eq. 
(6.9) must then be > at = 


b= art, WH 1 ay. y = asin n7x. (6.14) 


Note that the arbitrary constant a can have any value and still permit the function y 
to meet the boundary conditions, so that the solution is determined only to within a 
multiplicative constant. In Fig. 6.2 we sketch several of the solutions as given by Eq. 
(6.14). 
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The information of interest in characteristic-value problems are these characteristic 
values, or eigenvalues, for the system. If we are dealing with a vibration problem, these 
give the natural frequencies of the system, which are especially important because, with 
external loads applied at or near these frequencies, resonance will cause an amplification 
of motion so that failure is likely. In the field of elasticity, there is an eigenfunction 
corresponding to each eigenvalue, and these determine the possible shapes of the elastic 
curve when the system is in equilibrium. Often the smallest nonzero value of the parameter 
is especially important; this gives the fundamental frequency of the system. 

To illustrate the use of numerical methods, we will solve Eq. (6.13) again. Our attack 
is the previous one of replacing the differential equation by a difference equation, and 
writing this at all points into which the x-interval has been subdivided and where y is 
unknown.* Replacing the derivative in (6.13) by a central-difference approximation, we 
have 


al=l + Ry, = 0. (6.15) 


Letting h = 0.2, and writing out Eq. (6.15) at each of the four interior points, we 
get 
y, — (2 — 0.04k)y. + y3 = 0, 
Y2 — (2 — 0.04k)y, + y, = 0, 
y3 — (2 — 0.04k)y, + ys = 0, 
¥4 — (2 — 0.04k)y5 + yg = 0. (6.16) 


The boundary conditions give y; = yg = 0. Making this substitution and writing in 
matrix form, after multiplying all equations by —1, we get 


2 = 0.042 -1 0 0 y9 
=I 2 — 0.04k? =I 0 y3] _ 
0 -1 2-0.042 -1 |ly,}=2 17 
0 0 -1 2 — 0.04K7JLys 


Note that we can write this as the matrix equation (A — A/)y = 0, where 


2 
| 
o 
So 


ee - 2 
A=l 9 7 9 -yp 4 = 0.088, 


0 MH =f 2. 


and / is the identity matrix of order 4. We will consider the problem in this form in the 
next section. 


*There is an alternative trial-and-error approach, analogous to the shooting method, but it generally involves 
more work. 
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6.6 


The solution to our characteristic-value problem, Eq. (6.13), reduces to solving the 
set of equations in (6.16) or (6.17). 

Such a set of homogeneous linear equations has a nontrivial solution only if the 
determinant of the coefficient matrix is equal to zero. Hence 


2-0.042  -1 0 0 
ai, eons 0 
ea 0 -1 2 — 0.04% =i fo" 
0 0 -1 2-0.042 


Expanding this determinant will give an eighth-degree polynomial in k (this is called 
the characteristic polynomial of matrix A); the roots of this polynomial will be approx- 
imations to the characteristic values of the system. Letting 2 — 0.04k = Z makes the 
expansion simpler, and we get 


Zt 327 + 1 =0, 
The roots of this biquadratic are Z> = (3 + V5)/2, or 
Z = 1.618, —1.618, 0.618, —0.618. 


We will get an estimate of the principal eigenvalues from 2 — 0.044? = 1.618, 
giving k = +3.09 (compare to +77). The next eigenvalues are obtained from 


2 — 0.04K? = 0.618, giving k = +5.88 (compare to + 27), 
2 — 0.04? = —0.618, giving k = +8.09 (compare to + 37); 
2 — 0.04k? = —1.618, giving k = +9.51 (compare to + 477). 


Our estimates of the characteristic values get progressively worse. Fortunately the 
smallest values of k are of principal interest in many applied problems. To improve 
accuracy, we will need to write our difference equation using smaller values of h. The 
work soon becomes inordinately long, and we look for a simpler method to find the 
eigenvalues of a matrix. This is the subject of the next section. 

Another reason, beyond the amount of computational effort, that argues against 
expanding the determinant and then finding the roots of the characteristic equation is 
accuracy. Finding the zeros of a polynomial of high degree is a process that is often 
subject to large errors due to round-off. The process is very ill-conditioned. 


EIGENVALUES OF A MATRIX BY ITERATION 


We have seen that particular values of the parameter occurring in a characteristic-value 
problem are of special interest, and the numerical solution involves finding the eigenvalues 
of the coefficient matrix of the set of difference equations. To do this, we need to find 
the values of A that satisfy the matrix equation 
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where A is a square matrix. In effect, we look for a certain vector x that gives a scalar 
multiple of itself when multiplied by the matrix A. For most vectors chosen at random, 
this will not be true. If A is nm X n and nonsingular, there are usually exactly n different 
vectors x which, when multiplied by the matrix A, give a scalar multiple of x. These 
vectors are called eigenvectors or characteristic vectors. The multipliers A, which are n 
in number, are termed the eigenvalues or characteristic values of the matrix A. As we 
have seen in the previous section, one method is to solve the equivalent matrix equation 


(A — ADx = 0 


by determining the values of A that make the determinant of (A — A/) equal to zero. 
This involves expanding the determinant to give a polynomial in A, the characteristic 
equation, whose roots must be evaluated. This tells us, of course, that there will be n 
eigenvalues for the system, but they will not necessarily all be different in value. In fact, 
it is possible that some eigenvalues will be complex-valued. Solving for the roots of the 
characteristic equation is laborious.* This section develops an alternative iterative pro- 
cedure that involves less effort. 
Consider a simple 3 x 3 system: 


10 0 O]fxy xy 
1 -3 -7]}x2} = Ajxl, 
0 2 6]Lx x3 
10-A 0 0 |x, 0 
1 =S= 2h. =F |x] = 1/0 (6.18) 
0 2 6-Ajls} [Lo 


We wish to find the values of A and vectors x that satisfy Eq. (6.18). Let us first see 
what they are. 
The characteristic equation is 


(10 — A)(—3 — AY6 — A) + 14010 — A) = 0, (10 — AMA — 4)(A + 1) = 0. 
The roots are 


A, = 10, A, = 4, and A3=-1. 


10 OO} [xy x Ox, = 

1 -—3 -7|}/x2.| = 10)x], X, — 13x, — 7x3 = 0, 
0 2 6) Lx x3. 2xy — 4x3 = 
10 OO Oj}fx x 6x, = 

1 -3 -7{|ml}= 4] x], x, — 7x, — 7x3 = 0, 
0 2 6} Ly x3 2x) + 2x3 = 


*The labor involved is not only in finding the roots of the characteristic polynomial but in evaluating the 
coefficients in the first place 
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10» 20) S0))\Pa x Ihy = x, =0 
Ue ey A Ut iad | R= BE SG gine al 
0 2 6h, ie 2x, + Tx; = 0 x3 = -3 

We see that any nonzero value can be taken for x, in the first vector, for x2 in the 
second, and for x, in the third. Our eigenvectors are then any multiple of 


In the matrix of Eq. (6.18) the sum of the eigenvalues 10, 4, —1 is equal to the sum 
of the diagonal elements of the given matrix. This is no coincidence; in fact, for any 
matrix A, the sum of its eigenvalues is equal to the trace of A, tr(A), that was defined 
in Section 2.2. 

In Chapter 5 we considered initial-value differential equations. The eigenvalues and 
eigenvectors of a matrix are important in the solution of a system of differential equations 
with constant coefficients. Consider the system 


x' = 10x, 
y= x= Sy— Tz, 
2y + 62, 
where x(0) = 1, y(0) = —1, 2(0) = 2. We may rewrite these equations as 
10) e000 1 
2H) =| L =3) —T/X@),, XOH HU 
O 2 6 2 


The general solution to this problem is 


s 0 0 
X(t) = Ae!) 35) + Be | -1) + Ce"! 1); 
1 7 


uM 
33. 


that is, 
x(t) = Ae! 
2 
y(t) = za4em — Be" + Ce 
2(t) = J germ + Be* — Zoe" 
- 33 7 
where 


6.6: EIGENVALUES OF A MATRIX BY ITERATION 439 


2 
-1l=y4-B+C 


1 2 
2= 334 + BB 5¢ attr = 0. 
Unfortunately, we cannot find the eigenvalues of a general matrix by simply reducing 
it to triangular form by Gaussian elimination, as one at first might hope. The reason is 
that row reduction changes the eigenvalues. For example, consider the matrices A and 


B: a=[7 3]. e=[5 3} 


Matrix B is derived from A by row reduction: it has eigenvalues 2 and 3 while the 
eigenvalues of A are 3 and 1. 7 

We now illustrate an iterative procedure for determining the eigenvalue of largest 
magnitude of Eq. (6.18). This is called the power method. We start with an arbitrary 
vector and multiply it by the matrix repeatedly. Using 


Lame 


we have the results below. At each step, we normalize the vector by making its largest 
component equal to unity. 


oro or ooroorococlor oO 
\ 
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o- 
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1 


1 10 1 
0.065) =| 0.623 10] 0.0623 |, 
0.026. 


[10 [1 
0.0623 | = | 0.6129 10] 0.06129 |. 


0.0286, 


: 

dl | 
ee] -[ts] - text 
4 

4k 


440 CHAPTER 6: BOUNDARY-VALUE PROBLEMS AND CHARACTERISTIC-VALUE PROBLEMS 7 


We see that the successive values of the product vector approach closer and closer 
to multiples of the eigenvector 


Fred dia 


The normalization factors approach the value of the largest eigenvalue. 
Before leaving this example, we again observe that the 


10 O 0O\ 
tL) 3. 7 = 13'— 10 4 — ale 
Oo 2 6; 


In programming, one should use this fact to keep a check on the accuracy of the eigen- 
values found. 

The method is slow to converge if the magnitudes of the largest eigenvalues are 
nearly the same. 

With more realistically sized matrices, the iteration takes many steps and requires a 
computer for its practical execution. The method will converge only when the eigenvalue 
largest in modulus is uniquely determined. 

The power method works because the eigenvectors are a set of basis vectors. By 
this we mean that they are a set of linearly independent vectors that span the space; that 
is, any n-component vector can be written as a unique linear combination of them. Let 
v by any vector and x1,.%2,..., x, be eigenvectors. Then 

YO'= cyxy + cary b ot F ep Xy- 


If we multiply v by the matrix A, we have (since the x; are eigenvectors with 
corresponding eigenvalues A;), 


yD = Ayo 


yay rH egatg Fa Ax, 


CyAyXy + CpAQXg + ++ + + CpAnXn (6.19) 


i 


Upon repeated multiplication by A we get, after m times. 


yin) = Amy) = ce) \Mx, 


(6.20) 


If one eigenvalue, say A;, is larger in magnitude than all the rest, the values of A7, 
i + 1, will be negligibly small in comparison to A’ when m is large and 


A™y > cA"), or Av — Multiple of eigenvector x;. [ 


a 


with the normalization factor of A,, provided that c, + 0. This is the principle behind 
the power method. 
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In applying Eq. (6.20) to find the dominant eigenvalue (the eigenvalue whose mag- 
nitude exceeds all others), we normalize at each step; this in effect is just scaling the c’s 
after each multiplication, and will not change the relation. If some other eigenvalue than 
A, were the dominant one, we would merely renumber the eigenvalues. It is easy to see 
why the rate of convergence depends strongly on the ratio of the two eigenvalues of 
largest magnitude. 

Obviously, if we are able to choose the starting vector v\°) close to the eigenvector 
x,, we will converge more rapidly; the coefficients c;, i = 1, in Eq. (6.19) will be small 
in this case. We usually have no knowledge of this eigenvector, so we start with an 
arbitrary vector with all components equal to unity. This choice could be a bad one. If 
the value of c, in Eq. (6.19) is zero for this arbitrary choice, the process may never 
converge. Actually, the process often does converge in spite of this; round-off errors 
incurred as we repeatedly multiply by A produce components in the direction of x, so 
that we do converge, although slowly. 

A special advantage of the power method is that the eigenvector that corresponds to 
the dominant eigenvalue is generated at the same time. For most methods of determining 
eigenvalues, a separate computation is needed to obtain the eigenvector. A disadvantage, 
of course, is that it gives only one eigenvalue. Sometimes the largest eigenvalue is the 
most important, but if it is not, we must modify the method. 

In some problems, the most important eigenvalue and eigenvector is that where the 
eigenvalue is of least magnitude. We can find this one if we apply the power method to 
the inverse of A. This is because the inverse matrix has a set of eigenvalues that are the 
reciprocals of the eigenvalues of A. This is readily shown: 


Given that Ax = Ax. 
Multiply by Av !: 
AlAx = A7!Ax = AAT! x. 


From this we see that 


(It is inefficient to actually invert A before applying the power method to get the eigen- 
value. We use the LU equivalent of A, obtainable at much less effort, and solve for v\"*!) 
from 


LUvt) = AyrD = yn, equivalent to Aly) = yt), (6.21) 


The solution of Eq. (6.21) when the LU is known requires the same amount of effort as 
a matrix multiplication.) 

The power method is apt to be slowly convergent. We can accelerate it if we know 
an approximate value for the eigenvalue. This is based on the result of shifting the 
eigenvalues. Observe that subtracting a constant from the diagonal elements of A gives 
a system whose eigenvalues are those of A with the same constant subtracted: 


Given Ax = Ax. 
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Table 6.11 


Subtract s/x = sx from both sides. 


=e Ax — (sh = Aa = sxe 


| 
| 
(A — sI)x = (A — s)x. | 
The above relationship can be applied in two ways. Suppose we wish to determine the | 
value of an eigenvalue near to some number s. We shift the eigenvalues by subtracting 
s from the diagonal elements; there is then an eigenvalue very near to zero in the shifted 
matrix. We use the power method on the inverse of the shifted matrix. This is often 
rapidly convergent because the reciprocal of the very small value is very large, and is 
usually much larger than the next largest one (for the shifted inverse system). After we 
obtain it, we reverse the transformations to obtain the desired value for the original matrix. 
This process is called the inverse power method. It is illustrated in the example below. 
Another application of shifting is to determine the eigenvalue at the other extreme 
after the dominant one has been computed. Suppose a matrix has eigenvalues whose 
values are 8, 4, 1, and —5. After applying the power method to get the one at 8, we 
shift by subtracting 8. The eigenvalues of the shifted system are then 0, —4, —7, and 
—13. The power method will get the dominant one, at —13. When we add 8 back to 
reverse the effect of the shifting, we obtain —5, the eigenvalue of the original matrix at 
the opposite extreme from 8. 
We illustrate the use of the power method and its several variations with the 3 x 3 


matrix: 
ee | 1 
A=) 1 1 1}. 
=). D6 


If we begin with the initial vector (1, 1, 1)’, the regular power method gives the results 
shown in Table 6.11. The iterations were terminated when the normalization factor 
changed by less than 0.0001 from the previous value. 

After the dominant eigenvalue at —5.76849 was determined, the matrix was shifted 
by subtracting this value from the elements on the diagonal. For this shifted matrix, the 
power method gave, after 32 iterations, a normalization factor of 9.23759 with a vector 
of (1, 0.31963, —0.21121)". We then know that the original matrix had an eigenvalue 
at 3.4691 (from 9.23759 + (—5.76849)). (The corresponding eigenvector is the same as 
that determined by the power method for the shifted matrix.) 


Iteration Normalization Resultant 
number factor vector 
1 -8 (-0.5, —0.375, 1)7 
2 = (0.125, —0.025, 1)7 


3 —6.25 (—0.244, —0.176, 1)7 


22 —5.76849 (—0.1157, —0.1306, 1)7 
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Applying the power method to A~! (through application of the LU equivalent, of 
course) gave, after nine iterations, the value of 0.76989 with a vector of (0.4121, 1, 
—0.1129)". Now we know that the original matrix A had its smallest eigenvalue at 
1/0.76989 = 1.29923. For this 3 x 3 system, we have computed all its eigenvalues. 

The dominant eigenvalue came with some slowness; 22 iterations were needed. If 
we shift by —5.77 and apply the method to the inverse, only four iterations are required 
to obtain a value of 672.438 and a vector of (—0.1157, —0.1306, 1). This is the same 
vector as that obtained in the first application of the method. We get the dominant 
eigenvalue for A by adding —5.77 to the reciprocal of 672.438: 


Za ag + (-5.77) = 5.76851. 


Many iterations have been saved. 

The key to the inverse power method is finding an approximate value for the desired 
eigenvalue. Sometimes the physical problem itself can suggest a value. Applying the 
standard power method for only a few iterations frequently gives an approximate value 
(the many successive iterations are needed to refine the value). Gerschgorin’s theorems 
can give good estimates if there is strong diagonal dominance. Let D; be a circle whose 
center is a,, and whose radius is 


Dla. fJ=1,2,...,m and j#i. 


Then Gerschgorin I says that every eigenvalue of A must lie in the union of those circles. 
Gerschgorin II says that, if k of these circles do not touch the other n — k circles, then 
exactly k eigenvalues (counting multiplicities) lie in the union of those & circles. For the 
previous example we have 


D, D, D 
/ f f Pa 
/ 2 2 2 
- aaa + — - 
\ (6,0) } (1, 0) (4, 0) 


The dominant eigenvalue, —5.76851, was found to lie in D, and the other two are in 
D, UD). 

When a matrix has two eigenvalues of equal largest magnitude,* the power method 
as described will fail. In the case of two largest eigenvalues of opposite sign, the nor- 
malization factor does not converge but oscillates. For example, when the power method 
is applied to 


=I) Se 7 
A= 5 D3 4), 
=] = 6 


*This may occur if the eigenvalues are complex; for a matrix with real coefficients, complex eigenvalues will 
come in complex conjugate pairs. 
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Table 6.12 


which has eigenvalues of 4, —4, and 1, the power method gives the results shown in 
Table 6.12 and the normalizations continue indefinitely to be —10 and —1.6. In sucha 
case, the magnitude-of the largest eigenvalue(s) is given by taking the square root of the 
product of the alternating normalization factors; in this example, —10 and —1.6 have a 
product, 16, whose square roots are 4 and —4. 

In case all the eigenvalues of a large system are required, the QR algorithm is the 
usual method. This method is based on several properties of eigenvalues that we already 
know. For instance, let Ax = Ax where A is an eigenvalue of A and x an eigenvector 
corresponding to A. From this it is easy to show that if A is nonsingular, then A~! is an 
eigenvalue of A~!. Similarly, we have seen that for this same matrix A we have 
(A — @) as an eigenvalue of (A — al) and (A — a)~! as an eigenvalue of (A — a/)~!. 
In these last two matrices we just subtract a from the diagonal elements of A. There is 
only one new property to add to this list—namely, that for any nonsingular matrix B: 


BAB™'!x = Ax. 


This follows from the property of determinants that says that det(A — Al) = 
det(BAB~! — AJ). From this we can show that all the eigenvalues of A and BAB™! are 
the same. This property is used again and again in the QR method. In fact, two matrices 
A and C are said to be similar if we have BAB™! = C. 

There is an important class of matrices called the Hessenberg matrices. They are of 
the form H(i, j) = 0 if i> G+ 1). A few examples will clarify this definition. 


7 8 -8 3 -3 

4 3 1 6 9 -2 -1 29 
m- [3 3 ‘| and H,=|0 9 5 9 1|. 

0 3 -9 OVO s Be Hie, 5 

0 0 0 -9 21 


We can describe the Hessenberg matrix as “almost triangular.” The QR method takes a 
matrix and makes it into a triangular one by similarity transformations so that the eigen- 
values of the original matrix are maintained. The details of the algorithm can be found 
in Stewart (1973). However, we will give a simple description of the method along with 
an example. Program 3 of this chapter implements the QR method for finding eigenvalues 
of a matrix. 

Suppose we start with a matrix A, which we denote as Ao. 


Iteration Normalization Resultant 
number factor vector 
1 —10 (1, —0.6, 0.6)" 
2 16 qd, 0. 0.25)" 
3 —10 (1, —0.525, 0.675)? 
4 -1.6 (1, 0.2031, 0.2031)" 
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Step 1. Let A; = PAgP~' where A, is a Hessenberg matrix. The matrix P is 
just the product of elementary matrices, called Householder reflectors. These 
reflectors have a very useful property in that, if B is such a reflector, then B = 


B’ = B~'. These matrices can be generated easily. Take any nonzero vector x; 
then B = / — axx7 is a reflector, provided a = 2/(x7x). (Note: xx" is an “outer 
product” since x isn * 1, and xx? isn * n.) 


For instance, let Ag be a 4 x 4 matrix, 


* * * * 
* * * * 
Ao=|, «4 » »| > then reflectors B)B, can be 
+ # * 
found so that 
* * * * 
Tr * * * * 
BiAoBi =\y « « « and 
oO * * * 
+ * e * 
+ eo k * 
B>ByAB1BS = 9 * «* «| Then let P = BoB). 
0 0 * * 


This first step produces a matrix that is almost upper-triangular. 


Step 2. Now we perform an iterative procedure on the matrix from step | as it 
is transformed into one with more zeros below the main diagonal until it becomes 
upper-triangular. Here we have fori = 1, 2,3... . 


A, = Q,R;, where Q7Q, = J and R, is upper-triangular. Then set 
A; = R,Q,. It is easy to verify that A;, A;,,, are similar, since 


Aj+, = Q7A,Q;. Once all the elements below the main diagonal are zero 


(less than TOL in absolute value) we stop. Actually the process is speeded up 
by shifting the matrix as we go along. By modifying the preceding formulation, 
we have 


A; — a = Q:R;, 
Aj+1 = RQ; + ail. 


In this case A;, A,,,, are still similar and so have the same eigenvalues. 
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6.7 


We illustrate these steps by seeing the output from Program 3. Let Ap be the matrix 


- a Ne ea 
“Wt 6 =r = 
lel P heehee) 
ls a 3 4 


whose eigenvalues are {10, 1, 4, 7}. After step 1, we have 


7:00) 9:65) —6:15' —2:24 
—3.32 482 4.44 1.23 

0.00 -3.24 3.17 —0.81)" 

0.00 0.00 040 7.01 


A, = 


a Hessenberg matrix. In step 2 we find that after eight iterations we obtain 


est 9:83 49 3.27; 
0.00 1,00 L836 272 
0.00 0.00 4.00 -1.70) 
0.00 0.00 0.00 — 7.00, 


A3 = 


The intermediate matrices are given after Program 3 in Figure 6,5. It should be noted 
that, although the matrices are different, trace (A;) = 22, = 0, 1,... , 9, since we 
have mentioned previously that the trace of a matrix is equal to the sum of its eigenvalues. 
This property turns out to be a very useful check on the implementation of the method 
on the computer. 

The EISPACK programs and IMSL contain subroutines for finding eigenvalues and 
eigenvectors that are based on the QR method. In particular, these will handle the complex 
case. In addition, the LINPACK library uses QR in the singular-value-decomposition of 
a (possibly rectangular) matrix. This latter is applied to least-squares problems (Chapter 
10). There are other variations on this theme. For symmetric matrices, the Jacobi method 
has been widely used. 


EIGENVALUES/EIGENVECTORS IN ITERATIVE METHODS 


As an illustration of the importance of eigenvalues in numerical analysis, we conclude 
this chapter by discussing their application to the convergence of iterative methods for 
solving a set of linear equations. 

In Section 2.10, we considered an example system of equations, Ax = b, where 


8 Ly =1 8 
A=j|1 -7 2], b=)]-4|. 
2 1 9 12. 


In Section 2.10, we showed that both the Jacobi and Gauss-Seidel methods can be written 
in the form 
xt) = Gx”) = 5! — Bx, (6.22) 
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For the Jacobi method applied to the example, 


0.0 0.125 -0.125 
B=|-0.143 0.0 —0.286], (6.23) 
0.222 0.111 0.0 


and for Gauss-Seidel, 


0.0 0,125 —0.125 
B={|0.0 0.018 —0.304}. (6.24) 
0.0 —0.030 0.062 


We now will show that whether an iterative method converges is determined by the 
eigenvalues of its B matrix, and, further, that the rate of convergence is also related to 
their magnitude. 

The general formulation for iteration is x") = b’ — Bx”, If an iterative method 
converges, then x") — X, where the symbol ¥ represents the solution. Hence At = b. 
Eq. (6.22) becomes, for x = x, 


X = b' — Bx. 
Let e” be the error of the nth iteration: 
ef”) = x) — Zz 


When the method converges (and it certainly does if the matrix A is diagonally dominant), 
e”) —> 0, the zero vector, as n increases. Using Eg. (6.22), we have 


ett) = —Be™ = Bet) Seer (—By"*1e, 


If it should be that B” — 0, the zero matrix, obviously e”) — 0. In linear algebra it is 
shown that any square matrix B can be written as B = UDU™'. If the eigenvalues of B 
are distinct, then D is a diagonal matrix with the eigenvalues of B on its diagonal. (If 
some of the eigenvalues of B are repeated, then D may be triangular rather than diagonal. 
Our argument holds in either case.) 

We have 


B=UDU"', B?=UD?U"', BB =UD3U"',  ..., = B= UD"U"'" 


If all the eigenvalues of B (which are the elements on the diagonal of D) have magnitudes 
less than 1, it is clear that D” will approach the zero matrix. This implies that B" > 0- 
matrix, e”) — 0-vector, and x") > x. In other words, if all the eigenvalues of B, the 
iteration matrix, are less than | in magnitude, the iteration will always converge (even 
if the system is not diagonally dominant). It is also easy to see that the rate of convergence 
is determined by the largest eigenvalue in magnitude of B, since this controls how rapidly 
B" approaches the zero matrix. 

We now look at the eigenvalues of the B matrices in Eqs. (6.23) and (6.24). For the 
Jacobi matrix (6.23), we find that two are complex and one is real: 


A, = 0.036 + 0.285i, Az = 0.036 — 0.285i, As = —0.072. 
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The largest magnitude is (0.0367 + 0.2857)? = 0.287. Since all of its eigenvalues 
are less than | in magnitude, Jacobi converges for the example problem. (This confirms 
the demonstration in Section 2.10.) 

For the Gauss-Seidel B matrix (Eq. (6.24)), all the eigenvalues are real: 


A,=0, Ay = 0.058, A; = 0.137, 


of which the largest magnitude obviously is 0.137. Therefore the Gauss-Seidel also 
converges (again confirming the demonstration). It converges faster than Jacobi because 
its eigenvalue of largest magnitude is smaller. The more superficial argument about the 
relative rates of convergence given in Section 2.10 is now put on a sound basis. 

We used the IMSL subroutine EIGRF to obtain these eigenvalues. EIGRF is useful 
to find the eigenvalues and eigenvectors of a matriy whose components are all real 
numbers. 


6.8 CHAPTER SUMMARY 


Can you do all of the following? You can if you really understand Chapter 6. 


1. Solve boundary-value problems using either the shooting method or through a set of 
equations. 


2. Explain when no more than two trials are needed when using the shooting method. 
You can outline an efficient procedure to use when more than two trials are required. 


we 


Apply the Rayleigh—Ritz method to a second-order differential equation with homo- 
geneous boundary conditions. 


4. Explain what is meant by the term characteristic-value problem and solve simple 
examples of them. 


5. Find the largest and smallest eigenvalues of a matrix and the associated eigenvectors. 


6. Outline the arguments that relate convergence of an iterative method to the eigenvalues 
of a particular matrix. 


7. Use computer programs that employ the shooting method for a boundary-value 
problem and the power method for eigenvalues. You should be able to adapt the 
former to handle derivative end conditions. 


SELECTED READINGS FOR CHAPTER 6 
Burnett (1987); Stewart (1973); Vichnevetsky (1981); Wilkinson (1965). 
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COMPUTER PROGRAMS 


Program | (Fig. 6.3) is an example of how a boundary-value problem can be solved by 
a computer. In it, a second-order boundary-value problem 1s solved, using SUBROUTINE 
RKSYST to integrate the differential equation. Two preliminary solutions are obtained, 
employing two guesses for the initial slope of the function. From these, an improved 
estimate of the initial slope is computed by interpolation; the process of interpolation 
based on the last two values is iterated until the calculated value for the function at the 
end of the interval closely approximates the desired value. The program is a straightfor- 
ward implementation of the technique of Section 6.1. 

If one wishes to program the difference-equation method, the differential equation 
would be approximated by a system of finite-difference equations and the solution to this 
system would be obtained by the subroutines of Chapter 2. In case the boundary-value 
problem is nonlinear, some iterative method normally would be required. 

A subroutine POWER, which finds the largest eigenvalue of an N * N matrix (together 
with its corresponding eigenvector), is given in Program 2 (Fig. 6.4). To use it, the matrix 
and an initial estimate of the eigenvector are passed as parameters. When no estimate of 
the eigenvector is known, one usually sends a vector that has each element equal to unity, 
although there are some matrices for which this will not produce convergence. In such 
cases, one should call the subroutine with a different starting vector, although it must be 
remembered that the power method fails for some matrices. 

In subroutine POWER, the starting vector is multiplied successively by the matrix 
in each iteration, with the product of each multiplication being normalized. When the 
method converges, the normalized product vectors approach the eigenvector and the 
normalization factors approach the eigenvalue. The subroutine terminates iterations when 
successive values of the normalization factor differ by less than a stated tolerance value. 
Normalization is done by dividing each element of the vector by the largest element, so 
the normalized vectors always have unity as their largest element. 

Program 3 (Fig. 6.5) implements the QR algorithm given in Stewart (1973), namely, 
Algorithms 1.1 and 3.5. This program can find the real eigenvalues of a matrix. The 
input values are A, the real matrix; N, the size of the matrix; TOL, the value for 
determining when a subdiagonal element is close enough to 0; and MAXIT, the maximum 
number of iterations. SUBROUTINE HESSN returns the matrix that is similar to the 
original matrix and that is in Hessenberg form. SUBROUTINE QR is the main one that 
generates a sequence of matrices that converge to a triangular one or reaches MAXIT. 
For an N x N matrix, an additional two rows are used to store values. With some 
modifications one can generate the eigenvectors as well or make use of SUBROUTINE 
POWER and shifting. For output we have (1) the original matrix, (2) the Hessenberg 
matrix similar to the original matrix, (3) the iterates on the Hessenberg matrix as it 
converges through similarity matrices to an upper-triangular matrix whose diagonal ele- 
ments are the eigenvalues of the original matrix, and (4) the eigenvalues listed. One can 
verify that the trace for each matrix equals 22 since similar matrices have the same 
eigenvalues. 
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PROGRAM BNDARY (INPUT, OUTPUT) 


THIS PROGRAM SOLVES A NON-LINEAR SECOND ORDER BOUNDARY VALUE 
PROBLEM BY THE SHOOTING METHOD. IT EMPLOYS RKSYST TO SOLVE THE 
EQUATIONS WITH TWO ASSUMED VALUES OF THE INITIAL SLOPE. 


SUBROUTINE RKSYST IS BASED ON THE SIMPLE FOURTH ORDER RUNGE- 


KUTTA METHOD. IT WOULD BE EASY TO REPLACE IT WITH SUBROUTINE 
RKFSYS IN CHAPTER 5. 


THE EQUATION IS : D2X/DT2 + (3X) (DX/DT) + X = T X(1)=2,X(3)=7 
IT IS REWRITTEN AS TWO FIRST ORDER EQUATIONS : 


DX1/DT = X2 xX1(1) = 2 
DX2/DT = T - Xl - 3(X1) (X2) X2(1) = ? 


WHERE X1 = X AND X2 = DX/DT 


THE PROGRAM INTERPOLATES FROM THE INITIAL RESULTS TO FIND A 
BETTER ESTIMATE OF THE UNKNOWN INITIAL VALUE OF DX/DT = X2(1). 
ITERATIONS ARE CONTINUED UNTIL SUCCESSIVE VALUES OF X(3) EQUAL 7 
WITHIN A SPECIFIED TOLERANCE. IN THE PROGRAM, VARIABLES X1 & X2 ARE 
THE FIRST AND SECOND COMPONENTS OF THE VECTOR X0. 


AaAARAAAAAAAAANAAAAAAAAAAAAAAAAA 


REAL X0(2),XEND (2) ,XWRK(4,2),F(2),T0,TSTART, H, G1, G2 
+ , TOL, XSTART,D, R1,R2 

INTEGER I, ITER 

EXTERNAL DERIVS 


INITIALIZE THE VARIABLES WITH A DATA STATEMENT 


oaaaa 


DATA H,N,G1,G2,TOL/0.1,2,-1.0,1.0,.001/ 
DATA TSTART, XSTART,D/1.0,2.0,7.0/ 


DO THE INTEGRATION ONCE. 


aaqaaa 


x0(1) = XSTART 
TO = TSTART 
x0(2) = G1 
DO 10 I = 1,20 
CALL RKSYST (DERIVS, T0,H,X0,XEND, XWRK, F,N) 


STEP UP THE VARIABLES FOR THE NEXT INTERVAL. 


aaa 


X0(1) = XEND(1) 
X0(2) = XEND(2) 


TO = TO +H 
10 CONTINUE 


Figure 6.3 Program 1. 
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Figure 6.3 (continued) 


aaa ana 


aaaaaa aanaada 


aaaaaa 


aaa 


SAVE THE FIRST RESULT FOR EXTRAPOLATING. 
Rl = X0(1) 
DO THE INTEGRATION AGAIN WITH THE SECOND ESTIMATE OF DX/DT. 


TO = TSTART 

xX0(1) = XSTART 

xX0(2) = G2 

bo 20 I = 1,20 
CALL RKSYST (DERIVS, T0,H,X0,XEND, XWRK, F,N) 
X0(1) = XEND(1) 
x0(2) = XEND(2) 
TO = TO +H 

20 CONTINUE 


SAVE THE SECOND CALCULATED VALUE. 


R2 = X0(1) 


NOW WE EXTRAPOLATE. THEN REPEAT THE ABOVE CALCULATIONS A MAXIMUM 
OF 20 TIMES, OR UNTIL WE MATCH DESIRED VALUE OF X1(3). 


DO 40 ITER = 1,20 
PRINT 205, Gl,G2, R1,R2 
IF ( ABS(R2-D) .LE. TOL ) GO TO 99 
TO = TSTART 
X0(1) = XSTART 
X0(2) = Gl + (G2 - Gl)/(R2 - R1)*(D - R1) 
Gl = G2 
G2 = x0(2) 
Rl = R2 
DO 30 I = 1,20 
CALL RKSYST(DERIVS,T0,H,X0,XEND, XWRK, F,N) 
X0(1) = XEND(1) 
X0(2) = XEND (2) 
T0 = T0+H 
30 CONTINUE 
R2 = X0(1) 
40 CONTINUE 


WHEN WE HAVE A NORMAL TERMINATION OF THE LOOP, WE DIDN’T CONVERGE. 
PRINT A MESSAGE, THEN THE FINAL VALUE. 


PRINT 200 
Go TO 100 


WE DID CONVERGE IF WE COME HERE. PRINT MESSAGE AND VALUE. 


99 PRINT 201, ITER 
100 PRINT 202, R2,G2 


NOW RECOMPUTE AND WRITE OUT THE LAST SET OF COMPUTATIONS. 
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Figure 6.3 (continued) 


c 


aaa 


aaaaaaaaaaa 


TO = TSTART 
X0(1) = XSTART 
x0(2) = G2 


PRINT 203, TO,X0 

DO 50 I = 1,20 
CALL RKSYST (DERIVS,T0,H,X0,XEND, XWRK, F,N) 
X0(1) = XEND(1) 
X0(2) = XEND(2) 
TO = T0 +H 
PRINT 204, TO,X0 

50 CONTINUE 


200 FORMAT(/’ WE DID NOT MEET TOLERANCE CRITERION IN 20 ’, 
+ ‘ITERATIONS. FINAL VALUE AT END OF INTERVAL WAS ‘) 
201 FORMAT(/’ TOLERANCE CRITERION MET IN ‘,13,’ ITERATIONS.’, 
aa ‘ FINAL VALUE AT END OF INTERVAL WAS ’) 
202 FORMAT (/20X,F7.4,’ USING INITIAL SLOPE VALUE OF ’,F7.4) 
203 FORMAT(/’ LAST COMPUTATIONS WERE '//6X,’TIME’,5X, 
+ ‘X1 VALUE’ , 5X, 
+ *X2 VALUE’ //7X,F3.1,6X,F7.4,6X,F7.4) 
204 FORMAT (7X,F3.1,6X,F7.4,6X,F7.4) 
205 FORMAT(/’ GUESSES ARE: ‘,2F8.4,'; FINAL VALUES ARE: ’,2F8.4) 


SUBROUTINE DERIVS : 
THIS SUBROUTINE COMPUTES DERIVATIVE VALUES 
FOR THE R-K ROUTINE. 


REAL X(N),F(N),T j 
INTEGER N | 


F(1) = X(2) 

F(2) = T - X(1) = 3..0*X(1)*X(2) 

RETURN 

END 

SUBROUTINE RKSYST(DERIVS, T0,H,X0,XEND, XWRK, F,N) 


SUBROUTINE RKSYST : 

THIS SUBROUTINE SOLVES A SYSTEM OF N FIRST 
ORDER DIFFERENTIAL EQUATIONS BY THE RUNGE-KUTTA METHOD. THE 
EQUATIONS ARE OF THE FORM DX1/DT = F1(X,T) , DX2/DT = F2(X,T),ETC. 


6.9: COMPUTER PROGRAMS 


Figure 6.3 (continued) 


agAARAAAAAAAAAAAAA 


aaaaa 


aaaaa 
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DERIVS - A SUBROUTINE THAT COMPUTES VALUES OF THE N DERIVATIVES. 
IT MUST BE DECLARED EXTERNAL BY THE CALLER. IT IS 
INVOKED BY THE STATEMENT : 
CALL DERIVS (X,T,F,N) 


TO - THE INITIAL VALUE OF INDEPENDENT VARIABLE 

H - THE INCREMENT TO T, THE STEP SIZE 

x0 - THE ARRAY THAT HOLDS THE INITIAL VALUES OF THE FUNCTIONS 
XEND - AN ARRAY THAT RETURNS THE FINAL VALUES OF THE FUNCTIONS 
XWRK - AN ARRAY USED TO HOLD INTERMEDIATE VALUES DURING THE 


COMPUTATION. IT MUST BE DIMENSIONED OF SIZE 4 X N IN 
THE MAIN PROGRAM. 
N - THE NUMBER OF EQUATIONS IN THE SYSTEM BEING SOLVED 
FE - AN ARRAY THAT HOLDS VALUES OF THE DERIVATIVES 


REAL X0(N),XEND(N) ,XWRK(4,N),F(N),H,TO 
INTEGER I,N 


GET FIRST ESTIMATE OF THE DELTA X'S 


CALL DERIVS (X0,T0,F,N) 
DO 10 I = 1,N 
XWRK(1,I) = H * F(I) 
XEND(I) = XO(I) + XWRK(1,1)/2.0 
10 CONTINUE 


GET THE SECOND ESTIMATE. THE XEND VECTOR HOLDS THE X VALUES. 


CALL DERIVS (XEND, TO+H/2.0,F,N) 
DO 20 I = 1,N 
XWRK(2,T) = H * F(I) 
XEND(I) = X0(I) + XWRK(2,1)/2.0 
20 CONTINUE 


REPEAT FOR THIRD ESTIMATE 


CALL DERIVS (XEND, TO+H/2.0,F,N) 
DO 30 I = 1,N 
XWRK (3,1) = H * F(I) 
XEND(I) = XO(I) + XWRK(3,I) 
30 CONTINUE 


NOW GET LAST ESTIMATE 


CALL DERIVS (XEND,TO+H,F,N) 
DO 40 I =1,N 
XWRK (4,1) = H * F(Z) 
40 CONTINUE 


WE COMPUTE THE X AT THE END OF THE INTERVAL FROM A WEIGHTED AVERAGE 
OF THE FOUR ESTIMATES, THEN RETURN. 
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Figure 6.3 (continued) 


2 eines 
Do 50 I = 1,N We 
XEND(I) = XO(I) + ( XWRK(1,I) + 2.0*XWRK(2,I) + 
+ 2.0*XWRK(3,I) + XWRK(4,I) ) / 6.0 
50 CONTINUE 
RETURN 
END 


DEFINE THE FUNCTIONS OF THE PROBLEM IN TERMS OF THE 
F’S AND THE X’S. 


aaaa 


OUTPUT FOR PROGRAM 1 
GUESSES ARE: -1.0000 1.0000; FINAL VALUES ARE: 1.8623 -1153 
GUESSES ARE: 1.0000 39.6079; FINAL VALUES ARE: ZiLESS: 21279 
GUESSES ARE: 63.5995 75.5366; FINAL VALUES ARE: 6.3780 9346 


GUESSES ARE: 75.5366 76.9387; FINAL VALUES ARE: 6.9346 


2 

5 
GUESSES ARE: 39.6079 63.5995; FINAL VALUES ARE: 5.1279 6.3780 

6 

6.9986 

¥ 


GUESSES ARE: 76.9387 76.9694; FINAL VALUES ARE: 6.9986 0000 
TOLERANCE CRITERION MET IN 6 ITERATIONS. FINAL VALUE AT END OF INTERVAL WAS 
7.0000 USING INITIAL SLOPE VALUE OF 76.9694 


LAST COMPUTATIONS WERE 


TIME X1 VALUE X2 VALUE 
1.0 2.0000 76.9694 
1.1 6.6453 14.9031 
1.2 7.1277 5.8486 
ps. 7.2761 2.3699 
1.4 7.3204 8684 
1.5 7.3237 .2186 
1.6 7.3101 -.0579 
17 7.2895 S726 
1.8 7.2663 -.2178 
BPS) 7.2422 -.2338 
2.0 7.2180 -.2374 
Zak 7.1941 -.2360 
2.2 7.1706 -.2324 
2.3 PALA: -.2280 
2.8 7.1249 -.2232 
205. 7.1028 -.2183 


6.9: COMPUTER PROGRAMS 


Figure 6.3 (continued) 


2.6 7.0813 = 12133 
2.7 7.0602 —.2082 
2.8 7.0396 =/2032, 
2.9 7.0195 -.1980 
3.0 7.0000 ~<929 
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SUBROUTINE POWER (A,X,C,TOL,NLIMIT,N,NDIM, XWRK) 


SUBROUTINE POWER : 
THIS SUBROUTINE COMPUTES THE LARGEST EIGEN- 
VALUE AND ITS CORRESPONDING EIGENVECTOR BY THE POWER METHOD. 


PARAMETERS ARE : 


A - AN N X N MATRIX WHOSE EIGENS ARE BEING DETERMINED 

x - ESTIMATE FOR THE EIGENVECTOR USED TO BEGIN ITERATIONS. 
IF NO APPROXIMATION TO THIS IS KNOWN, THE USUAL CHOICE 
IS A VECTOR WITH ALL COMPONENTS EQUAL TO UNITY. X ALSO 
RETURNS THE FINAL EIGENVECTOR TO THE CALLER. 

Cc - RETURNS THE VALUE OF THE EIGENVALUE 

TOL - TOLERANCE VALUE USED TO DETERMINE CONVERGENCE. ITERATIONS 
CONTINUE UNTIL SUCCESSIVE ESTIMATES OF THE EIGENVALUE 
ARE THE SAME WITHIN TOL IN VALUE. 

NLIMIT - LIMIT TO THE NUMBER OF ITERATIONS IF NOT CONVERGENT 


N - SIZE OF THE MATRIX AND THE VECTOR X 
NDIM - FIRST DIMENSION OF MATRIX A IN THE CALLING PROGRAM 
XWRK - VECTOR USED TO STORE INTERMEDIATE VALUES. MUST BE 


DIMENSIONED TO HOLD AT LEAST N ELEMENTS IN THE 
MAIN PROGRAM. 


aaaNAAAANAAAAAAAAAAAAAAAAAAAAA 


REAL A(NDIM,W) ,X(N) ,XWRK(N) ,C, TOL, SAVE 
INTEGER N,NDIM,NLIMIT, IROW, ITER, JCOL 


BEGIN THE ITERATIONS. GET THE PRODUCT OF A AND X, THEN NORMALIZE. 
THE NORMALIZATION FACTORS SHOULD CONVERGE TO THE EIGENVALUE, C, ON 
REPEATED MULTIPLICATIONS. WE STORE THE CURRENT VALUE OF C FOR 
COMPARISON WITH THE NEXT VALUE TO TEST CONVERGENCE. TO BEGIN, WE 
MAKE SAVE = 0. 


aaaaaaaaa 


SAVE = 0.0 
DO 50 ITER = 1,NLIMIT 
DO 20 IROW = 1,N 
XWRK (IROW) 0.0 
DO 10 JCOL = 1,N 
XWRK (IROW) = XWRK(IROW) + A(IROW, JCOL) *X(JCOL) 
10 CONTINUE 
20 CONTINUE 


Figure 6.4 Program 2. 
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Figure 6.4 (continued) 


FIND THE LARGEST ELEMENT OF THE PRODUCT VECTOR FOR NORMALIZING. 


ogg00n0 


c¢=0.0 
BO 30 TROW = 1,N 
T= { ABS(C) -LT. ABS (OWRE(TROW)) }) C = XWRK(IROW) 


NORMALIZE THE PRODUCT VECTOR AND PUT INTO X FOR NEXT ITERATION. 
DO 40 TROW = 1,N 


X(EROW) = XWRE (ROW) / C 
CONTINUE 


F TOLERANCE IS MET. IF SO, WE ARE DONE. IF NOT, WE CONTINUE 
ITERATIONS. 


c 
c 
¢c 


TF ( ABS(C - SAVE) -LE. TOL ) RETURN 
SAVE = Cc 
CONTINUE 


fT MEET THE TOLERANCE FOR CONVERGENCE, PRINT A MESSAGE 
VALUES CALCULATED. 


enn904a 


PROGRAM POR (INPUT, OUTPUT) 


THIS PROGRAM IMPLEMENTS THE QR ALGORITHM. 
INPUT: A,N, TOL 
A -— THE MATRIX WHOSE EIGENVALUES WE WANT. 
N - THE SIZE OF THE MATRIX, A 
TOL - THE VALUE ON THE SUBDIAGONAL ELEMENT WE SHALL TAKE 
as 0 


TWO SUBROUTINES ARE CALLED: 
SUBROUTINE HSESSN -— PRODUCES THE HESSENBERG MATRIX 
SUBROUTINE OR - THE ACTUAL QR METSOD IS APPLIED TO THE 
BESSENBERG MATRIX 


NOTE: TEIS PROGRAM AS GIVEN HERE JUST PRODUCES THE REAL 
EIGENVALUES OF THE INPUT MATRIX: A- 
Figure 65 Progam 3. 


aagnNAAANAAAAAAAADAA 


Figure 6.5 


aaaaaa 
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(continued) 


6.9: COMPUTER PROGRAMS 


ALGORITHMS 1,1 AND 3.5 OF STEWART ARE IMPLEMENTED IN THE 


SUBROUTINES. 


REAL A(12,10), TOL 
INTEGER N,I,J 


DATA N,NDIM, TOL/4,12,1.0E-6/ 
DATA MAXIT/50/ 


DATA A/7,1,1,3,8*0,8,6,-2,4,8*0, 6,-1,5,3,8*0, 


6,-2,-2,4,80*0/ 


ECHO ORIGINAL MATRIX 


PRINT 100 

CALL PRINTA(A,NDIM,N) 
PRINT 200 

CALL HESSN(A,NDIM,N) 

PRINT 300 

CALL PRINTA(A,NDIM,N) 
PRINT 200 

CALL QR(A,NDIM,N,MAXIT, TOL) 
PRINT 400 

CALL PRINTA(A,NDIM,N) 


PRINT OUT THE EIGENVALUES 


PRINT 500 


PRINT '(8F10.4)’, (A(I,I), I = 1,N) 


FORMAT (///T10,’THE ORIGINAL MATRIX IS: 


FORMAT (///) 


FORMAT (///T10,’THE HESSENBERG MATRIX IS: 


FORMAT (///T10,’THE FINAL MATRIX IS: 
FORMAT (////T10,'THE EIGENVALUES ARE: 


STOP 


SUBROUTINE PRINTA(A,NDIM,N) 
REAL A(NDIM,N) 
INTEGER NDIM,N, I,J 


DO 10 I =1,N 
PRINT 100, (A(I,J), J = 1,N) 
CONTINUE 

FORMAT (8F 10.4) 

RETURN 

END 


SUBROUTINE HESSN(A,NDIM,N) 
REAL A(NDIM,N), V(10), SIGMA, ETA, 


INTEGER N,NLESS1, NLESS2, K, KPLUS1, 


"Thy 
"fy 


PI 
J 


OPE 


"thy 
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Figure 6.5 (continued) 


SUBROUTINE HESSN 
INPUT : A THE ORIGINAL MATRIX A 
OUTPUT : THE MATRIX SIMILAR TO A BUT IN HESSENBERG FORM. 


SIMPLE REFLECTORS ARE GENERATED, V(1), V(2), ... , V(N-2) SO 
THAT A IS TRANSFORMED TO: 


V(N-2)*...*V(2)*V (1) *A*V(1) *V(2)*...*V(N-2) 


arananagqaaaa 


NLESS1 = 
NPLUS1 = 
NPLUS2 = N+2 
NLESS2 = 
DO 10 K = 1,NLESS2 

KPLUS1 = K+1 

ETA = ABS (A(KPLUS1,K) ) 


DO 20 I = KPLUS1,N 
IF (ETA .LT. ABS(A(I,K))) ETA = ABS(A(I,K)) 
20 CONTINUE 
IF (ETA .NE. 0.0) THEN 
SUM = 0.0 
DO 30 I = KPLUS1,N 
V(I) = A(I,K)/ETA 
A(I,K) = V(I) 
SUM = SUM + V(I)**2 
30 CONTINUE 


IF (V(KPLUS1) .NE. 0.0) THEN 
SIGMA= SQRT (SUM) *V (KPLUS1) /ABS (V(KPLUS1) ) 
ENDIF 

V(KPLUS1) = V(KPLUS1) + SIGMA 

PI = SIGMA*V(KPLUS1) 

A(NPLUS1,K) = PI 


aaacaaa 
8 
g 
q 
rs 
3 
4 
ww 
“ 
wy 
od 
i 
9 
° 
0 
wD 
4 
ta 
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a 
¢ 


DO 40 J = K+1,N 
SUM = 0.0 
DO 50 I = KPLUS1,N 
SUM = SUM + V(I)*A(I,J) 
50 CONTINUE 
RHO = SUM/PI 
DO 60 I = KPLUS1,N 
A(I,J) = A(I,J) - RHO*V(I) 
60 CONTINUE 
40 CONTINUE 
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Figure 6.5 (continued) 


DO 80 I = 1,N 
SUM = 0.0 
DO 90 J = K+1,N 
SUM = SUM + A(I,J)*V(J) 
90 CONTINUE 
RHO = SUM/PI 
DO 100 J = KPLUS1,N 
A(I,J) = A(I,d) - RHO*V(J) 


100 CONTINUE 
80 CONTINUE 
c 
c 
c 
A(NPLUS2,K) = V(KPLUS1) 
A(KPLUS1,K) = -ETA*SIGMA 
ELSE 
A(NPLUS1,K) = 0.0 
ENDIF 
10 CONTINUE 
c 
c 
c 
Cc ZERO OUT ENTRIES BELOW SUB-DIAGONAL 
c 
€ 
c 


DO 110 J = 1,NLESS2 
DO 120 I = J+2,N 
A(I,J) = 0.0 
120 CONTINUE 
110 CONTINUE 
RETURN 
END 


SUBROUTINE QR(A,NDIM,N, MAXIT, TOL) 
REAL A(NDIM,N),GAMMA(10), ETA, ALFA, BETA, DELTA, 


+ KAPPA, SIGMA(10) 
c 
INTEGER NDIM,N,NSTORE, MAXIT,KPLUS1,KLESS1,1,J,K 
fe 
NSTORE = N--- 
DO 10 I = 1,MAXIT 
€ 
QR seb eeeGe dct aac Saran ne nee eaten aman 
c 
c WE CHOOSE OUR CURRENT SHIFT VALUE SO THAT WE 
c ARE WORKING ON: A - K*I 
c 
ff, mesesesansesscecscedemer soma eens ete escent 
¢ 
KAPPA = A(N,N) 
A(1,1) = A(1,1) - KAPPA 
c 
e 
DO 20 K = 1,N 
KPLUS1 = K+1 
IF (K.NE.N) THEN 
C 
c 
c 
Cc DETERMINE ROTATION PARAMETERS: GAMMA, SIGMA, XNU 
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Figure 6.5 — (continued) 


Cc 
c 
Cc 


aaaaaaa 


aaaaanaa 


N 
° 
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ETA = AMAX1( ABS(A(K,K)),ABS(A(KPLUS1,K)) ) 
ALFA = A(K,K)/ETA 

BETA = A(KPLUS1,K) /ETA 

DELTA = SQRT(ALFA*ALFA + BETA*BETA) 

GAMMA (K) = ALFA/DELTA 

SIGMA(K) = BETA/DELTA 

XNU = ETA*DELTA 


A(K,K) = XNU 
A(KPLUS1,K) = 0.0 
A(KPLUS1,KPLUS1) = A(KPLUS1,KPLUS1) - KAPPA 


DO 30 J = KPLUS1,N 
AKJ = A(K,J) 
A(K,J) = GAMMA(K)*A(K,J) + SIGMA(K) *A(KPLUS1,J) 
A(KPLUS1,J) = GAMMA(K)*A(KPLUS1,J) - SIGMA(K) *AKJ 
CONTINUE 

ENDIF 


IF (K .NE. 1) THEN 
DO 40 J = 1,K 
KLESS1 = K-1 
AIKL1 = A(J,KLESS1) 
A(J,KLESS1) = GAMMA(KLESS1) *A(J,KLESS1) 


+ + SIGMA (KLESS1) *A(J,K) 
A(J,K) = GAMMA(KLESS1) *A(J,K) - SIGMA(KLESS1) *AIKL1 
CONTINUE 
A(KLESS1,KLESS1) = A(KLESS1,KLESS1) + KAPPA 
ENDIF 
CONTINUE 


A(N,N) = A(N,N) + KAPPA 


THE MATRIX SIZE IS REDUCED BY 1 AFTER EACH ITERATION IN 
ORDER TO BE MORE EFFICIENT. 


THE EIGENVALUES OF THE MATRIX APPEAR ON THE DIAGONAL ENTRIES OF 
A STARTING AT N. 


IF (ABS(A(N,N-1)) .LT.TOL) N = N-1 
IF (N .EQ. 1) THEN 

N = NSTORE 

RETURN 


Figure 6.5 (continued) 


¢ 

G 

Cc 

c 

Cc 

Cc 

Cc 

100 

10 
RETU! 
END 


TH 


7,0000 
1.0000 
1.0000 
3.0000 


TH 


7.0000 
-3.3166 
+0000 
+0000 


AT 


4.7690 
-10.1705 
+0000 
.0000 


AT 


2.3015 
=1.3015 
+0000 
.0000 


AT 


+2056 
-8.9815 
+0000 
-0000 


END IF 


PRINT 100, I 


CALL PRINTA(A,NDIM,NSTORE) 
FORMAT (///T10,’AT ITERATION ’,12,4X,’THE MATRIX IS: ‘'//) 


CONTINUE 
RN 


OUTPUT FOR PROGRAM 3 


THE MATRIX IS: 


6.0000 
-2.0000 
-2.0000 

4.0000 


Is: 


+2431 
+2343 
-8097 
+0126 


syiPN 


-2,3993 
3.8708 
+4239 
7.0012 


THE MATRIX IS: 


-3.2668 
2.7199 
1.6958 


E ORIGINAL MATRIX IS: 
8.0000 6.0000 
6.0000 -1.0000 

-2.0000 5.0000 
4.0000 3.0000 
IE HESSENBERG MATRIX 
-9.6484 -6.1545 
4.8182 4.4403 
-3,2423 3.1692 
-0000 . 3963 
ITERATION 1 
=i, 7275 4.8182 
4.7939 -5.9875 
-.5489° 5.4359 
.0000 -.0027 

ITERATION 2 

-10.7577 -.4193 
9.2441 4.2663 

-1,6038 3.4545 
0000 -0000 


ITERATION 3 
9400 2. 
10.9236 -4. 
+0561 3s 
+0000 


7.0000 


THE MATRIX IS: 


8150 
1503 
8708 


.0000 


-3.2668 
27159. 
-1.6958 
7.0000 
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Figure 6.5 (continued) 
AT ITERATION 4 “THE MATRIX IS: 
6.5804 -11.4904 pets oe as -3.2668 
-1.6623 4.4183 4.1916 2.7199 


-0000 -.0041 4.0013 -1.6958 
-0000 .0000 0000 7.0000 


AT ITERATION THE MATRIX IS: 


11.9350 6.6492 9054 -3.2668 
-3.1823 -.9350 -8256 2.7199 
0000 -0000 .0000 -1.6958 
-0000 +0000 .0000 7.0000 


AT ITERATION THE MATRIX IS: 


10.3857 9.4484 +9054 -3.2668 
-.3831 +6143 +8256 2.7199 
+0000 +0000 -0000 -1.6958 
.0000 +0000 -0000 7.0000 


AT ITERATION THE MATRIX IS: 


10.0158 9.8170 -9054 -3.2668 
-.0145 +9842 -8256 2.7199 
.0000 -0000 -0000 -1.6958 
-0000 -0000 +0000 7.0000 


AT ITERATION THE MATRIX IS: 


10.0000 9.8315 -9054 -3.2668 
-0000 1.0000 8256 2ei1o9 
+0000 -0000 -0000 -1.6958 
-0000 +0000 -0000 7.0000 


THE FINAL MATRIX IS: 


10.0000 9.8315 4.9054 -3.2668 
.0000 1.0000 1.8256 2.7199 
.0000 +0000 4.0000 -1.6958 
-0000 -0000 +0000 7.0000 


THE EIGENVALUES ARE: 


10.0000 1.0000 4.0000 
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EXERCISES 


Section 6.1 


) Solve the boundary-value problem 


i 


y" + xy’ — 3y = 4.2x, 
y(0) = 0, y(1) = 1.9, 
by the shooting method, assuming two values of the initial slope, which is near unity, Use 


h = 0.25 and use the modified Euler method, carrying three decimals. Compare to the 
analytical solution y = x° + 0.9x. 


il 


2. Show that the solution to Exercise 1 found by integrating with an initial slope interpolated 
from the results of the first two trials can be obtained by interpolating from the trial solutions 
themselves. 

3, ) In Exercise 1, the exact (analytical) value of the initial slope is 0.9. Using h = 0.25, you 
probably find that (1) = 1.9 is achieved with a different value of y'(0). Explain this 
discrepancy. 

4. The shooting method can work backward through the interval as well as forward. Solve 
Exercise | by moving backward from x = | (Ax is then negative), and use an assumed value 
for the slope at x = |. 

> 5. Use the shooting method to solve 
yay se, 
y(0) = 1, yl) = -1 
Use the computer programs of Chapter 5, or another one that you have written, to solve the 
problem with assumed values of the initial slope. Use h = 0.2. 

6. Exercise 5 solves a nonlinear boundary-value problem. The solution will not be exact because 
the solution of a differential equation by a numerical method has errors that depend on the 
step size. If the step size is reduced, the errors will decrease, Vary the step size in Exercise 
5 until you believe that the value of y at x = 0.5 is accurate to within 5 x 10-5. Justify your 
belief that you have attained this accuracy. 

7. Use the shooting method to solve 

Section 6.2 
8. Given the boundary problem 


o +y=0. 0) =0, »(5) =1. 
a) Normalize to the interval [0, 1] by an appropriate change of variable. 
b) Solve the normalized problem by the finite-difference method, replacing the derivative by 
a difference quotient of error O(h*), and then solving the set of equations. Use h = 0.25. 
»c) Solve the original equation (without normalizing) by finite difference. Take A@ = 
d) Compare your solutions to the analytical solution y = sin 6. 
e) Repeat (b) with h = 0.5, and extrapolate the value at the midpoint. 
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9. Solve the boundary-value problem 
wavy” + xy’ — xy = 2x, y(0) = 1, yl) = 0. 
Take h = 0.2. 
10, Solve Exercise 1 by replacing the derivatives with finite-difference quotients. 
IL. Solve Exercise 5 by rewriting the differential equation as a set of difference equations. 
12. Solve Exercise 7 through a set of difference equations. 


Section 6.3 
>13. Repeat Exercise 8, except for the boundary-value problem with derivative conditions: 
dy awe i 
qe ty=% y'(0) = 0, y( 
In part (d), compare to y = —cos 6. 
14. Solve 


d’y 
de 


Subdivide the interval into four equal parts. 


+y=0, yO) + yO) = 1, v (3) +f v( 


15. The most general form of conditions that can be specified for a second-order boundary-value 
problem is a linear combination of the function and its derivative at both ends of the interval. 
Set up the equations to solve 


y(0) + y'(0) — y(1) — y') = 1, 
y(0) + y'(0) + y(1) + y’() = 


ere es { 


Use h = 4. 
16. Solve the third-order boundary-value problem 


y"-y' =e, yO) = 0, y(1) = 1, y’(1) = 0. 


Use h = 0.2. Approximate the third derivative in terms of the average third central difference 


ya — 2y, + 2y_) — y-2 
mY. y 2 y=2 2 
y= a We + Ol) 


atx = 0.4, 0.6, 0.8. Using the derivative condition will eliminate the assumed function value 
at x = 1.2, but we are short one equation. Obtain this by writing the equation at x = 0.2 


using the unsymmetrical approximation for y": 


mn _ ~Yy + 6yy — 12y, + 10yo 
Lo 2h 


— 21 + oe). 
>17, Solve the fourth-order problem 
PP ty=%, wO=0, ¥O=0 Wh=2 yy =o. 
Use symmetrical expressions for the derivatives, 
18. Solve the differential equation in Exercise 17 subject to the conditions 
yO) = y'(0) = y"(0) = 0, yy) = 2. 


Unsymmetrical expressions may not be required! 
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19. Solve the nonlinear problem 


fad, y'(1) = 0.9093, (2) = 0.8268, 


sin? x” 


by a set of difference equations. Compare to the analytical solution, y = sin* x. 


Section 6.4 

20, Show that the integrand of Eq. (6.12) becomes Eq. (6.11) by use of the Euler—Lagrange 
condition. This means that Eq. (6.12) is the functional for any second-order boundary value 
problem of the form 

y" + Olay = Fix), 
when the boundary conditions are homogeneous: y(a) = 0, y(b) = 0. 
»21. Use the Rayleigh—Ritz method to approximate the solution of 
y’ = 2x, (0) = 0, yl) = 0, 

using a quadratic in x as the approximating function. Compare to the analytical solution by 
graphing both the approximate and analytical solutions. 

=<. a) Repeat Exercise 21 except use, for the approximating function, 

ax(x — 1) + bx*(x — 1). 

Show that this reproduces the analytical solutions. 


»b) Suppose the boundary conditions for part (a) are (0) = 1, y(1) = 3. How can you modify 
part (a) to give the solution? 


c) Another form of a cubic approximation that meets the boundary conditions is 
ax(x — 1) + bx(x — 1P. 
Is the analytical solution reproduced when this is used? 


23. Show that the use of a cubic approximating function in the first example of Section 6.4 
reproduces the analytical solution. 


24. Solve Exercise 21 by the piecewise application of Rayleigh—Ritz using first two, then four 
intervals. Compare these solutions to the analytical answer by graphing them. 


25. Solve by the piecewise Rayleigh—Ritz method, using five intervals: 


y 


¥#S =1+x, wI=2, y(4) = -2. 
Compare to solutions by the shooting method (use Runge—Kutta—Fehlberg) and by finite- 
difference approximations of the derivative: 


Section 6.5 

26. Consider the characteristic-value problem with & restricted to real values: 
dy —— _ _ 
qe Ry=0,  y)=0, yl) =0. 


a) Show analytically that there is no solution except the trivial solution y = 0. 

b) Show, by setting up the set of difference equations corresponding to the differential equation 
with h = 0.2. that there are no real values of k for which a solution to the difference 
equations exists. 
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»27. 


Consider the characteristic-value problem 
ww YY +2y'+hRy=0, (0) = yl) = 0. 
Find the principal eigenvalue and compare tok = +V1 + 7? = 43.297. _ 
a) Use h = 4. 
b) Use h = 
c) Use h = 4. 
d) Assuming errors are proportional to h, extrapolate from parts (a) and (c), then from parts 
(b) and (c), to get improved values for the principal eigenvalue. 
Find the principal eigenvalue of 
y'+Rxy=0, (0) = y(1) = 0. 


Wh 


29. Using the principal value, k = 3.297, in Exercise 27, find y as a function of x over the 
interval [0, 1]. This function is the corresponding eigenfunction. 

»30. The second eigenvalue in Exercise 27 is k = V1 + 47? = 6.3623. Find the corresponding 
eigenfunction. 

Section 6.6 

31. Find the dominant eigenvalue and the corresponding eigenvector by the power method: 

a) [2 0 ad 3 | o fi 1 
1 5 4 od 
gd) {4 1 0 ee) fl 1 2 
eo 2 1 O13 
O' >0) =I ya 
. (In part (c) the two eigenvalues are equal but of opposite sign.) 
32. \Let 
\ -5 2 0 -6+2i -1 -2i 
Am) 2 =o iy B= -3 T+#t 0) }: 
a= 7. i =o, as, 
a) Locate the eigenvalues of these matrices using Gerschgorin’s theorems. 
b) From part (a) determine whether either of the matrices is nonsingular. 

33. Use iteration to find all the eigenvalues of the matrices for Exercise 27(a), (b), (c). 

34. Invert the matrices in Exercise 31, then use the power method to get the smallest eigenvalues 
of the original matrices. Repeat, except avoid the inversion by using the LU decomposition 
of the matrix. Compare the effort in the two cases. 

\»35. Find all the eigenvalues of matrix A in Exercise 32. Then invert the matrix and again find 
£) the eigenvalues. Show that the two sets of eigenvalues are reciprocals of each other. 
FS After finding the dominant eigenvalue in Exercise 35, subtract that value from each of the 
=, diagonal elements and use the power method. Compare the value obtained with the second 
largest eigenvalue as determined in Exercise 35. 
37. Using your results from Exercises 31 and 34, find the general solutions to the following 


systems of differential equations. 
a) x= 2x, b) x= x+2y, 
y= xt Illy. y = Sx + 4y. 
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Section 6.7 
38. Let 
_fao 
sl ie | 
Show that 


a" 0 
n= 
s ete rh 


From this infer that 7" — O-matrix if |A] < 1. 


39. Let 

A, 0 0 

T=]a A 0 and 
boc Ay 

- A 0 0 

T=|a A 0 where 
bead 

A= Max { Aj, [Ao|, [As] } 


Write T = L + D, where 000 
L={|a 0 0}, D= 


a) Show that L’ = O-matrix. 
b) Since LD = DL, show that 


T" =(L+ Dy" 
(s)u2pe-2 + (T)Epe! + D», 


using the binomial formula. From this we see that 7” — O-matrix and also 7”, 
c) Prove this for any general triangular matrix 7, where |7,,) < | fori = 1,2... .,n. 


40. Suppose, for Ax = b, that 
432 
A=]2 3 4 
24a 


What is the smallest value of a for which iteration will converge 
a) using Jacobi? 
b) using Gauss-Seidel? 
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41. If a cantilever beam of length L, which bends due to a uniform load of w Ib/ft, is also subject 
to an axial force P at its free end (see Fig. 6.6), the equation of its elastic curve is 
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Figure 6.6 ° 


For this equation, the origin O has been taken at the free end. At the point x = L, dy/dx = 
0; at (0, 0), y = 0. Solve this boundary-value problem by the shooting method for a 2" x 
4" * 10-ft wooden beam for which E = 12 x 105 Ib/in?. Find y versus x when the beam 
has the 4” dimension vertical with w = 25 Ib/ft and a tension force of P = 500 Ib. Also 
solve for the deflections if the beam is turned so that the 4" dimension is horizontal. 

42. Solve Problem 41 by replacing the differential equation by difference equations. Compare 
the solutions by each method. 

43. In Problem 41, y” is used although the radius of curvature should be employed in the equation; 

y" _ we? 
owe ae 
If the deflection of the beam is small, the difference is negligible, but in the second part of 
Problem 41 at least, this is not true. Furthermore, if there is considerable bending of the 
beam, the horizontal distance from the origin to the wall is less than L, the original length 
of the beam. Solve Problem 41 taking these factors into account, and determine by how much 
the deflections differ from those previously calculated. 

44. A cylindrical pipe has a hot fluid flowing through it. Since the pressure is very high, the 
walls of the pipe are thick. For such a situation, the differential equation that relates tem- 
peratures in the metal wall to radial distance is 

du du 
raz ta = 
where 


r = radial distance from the centerline, 
u = temperature. 


Solve for the temperatures within a pipe whose inner radius is | cm and whose outer radius 
is 2 cm if the fluid is at 540°C and the temperature of the outer circumference is 20°C. 

45. The pipe in Problem 44 is insulated to reduce the heat loss, The insulation used has properties 
such that the gradient du/dr at the outer circumference is proportional to the difference in 
temperatures from the outer wall to the surroundings: 

du 


apa = 9) — 2 
| _, = 0:083{u(2) ~ 20). 


Solve Problem 44 with this boundary condition. 

46. When air flows at a velocity of u.. past a flat plate placed horizontally to the flow, experiments 
show that the air velocity at the surface of the plate is zero and that a boundary layer within 
which the velocities are less than u,. builds up as flow progresses along the plate (Fig. 6.7). 


Figure 6.7 


47. 


48 
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This is a typical problem of interest to aeronautical engineers. To solve for the velocities 
within the boundary layer, a stream function f is defined for which 


ay fa 
a2 


where z is a dimensionless distance involving x, y, us, and the viscosity of air. This is a 
boundary-value problem because f = 0 and f’ = 0 at z = 0, while f’ = 1 at z = ~. In 
solving this problem, one does not use z = * as a boundary; rather, the integration is taken 
to a value of z large enough so that there is an insignificant change when it is increased. 
Solve this problem to find f and df/dz as functions of z. 


A simple spring—mass system obeys the equation 


“3 + ay = 0, 

where the positive constant a? equals k/m, the ratio of the spring constant to the mass. If it 
is known that y = 0 at = 0 and again at r = 1.26 sec (so that its period is 2.52 sec), this 
is a characteristic-value problem that has a solution for only certain values of a. These 
characteristic values are discussed in Section 6.5 for this equation. Suppose, however, that 
the spring is not an ideal one with constant k, but that the spring force and elongation vary 
with y according to the equation k = b(1 — y®"). If it is still true that y = 0 at r = 0 and 
also at ¢ = 1.26, is this a characteristic-value problem requiring that the ratio k/m be at 
certain fixed values in order to have a nontrivial solution? If it is, find these values. 


A shaft rotates in fixed bearings (Fig. 6.8) so that at these points y = 0 and dy/dx = 0. The 
governing equation is 


where E is Young's modulus, / is the moment of inertia of the cross-sectional area of the 
shaft, y is the displacement from the horizontal, x is the distance measured from the point 
midway between the bearings, w is the weight of the shaft per unit length, g is the acceleration 
of gravity, 2L is the length of the shaft between the bearings, and w is the rotational speed. 


470 CHAPTER 6; BOUNDARY-VALUE PROBLEMS AND CHARACTERISTIC-VALUE PROBLEMS 


At certain critical speeds, the shaft will distort from a straight line. These critical speeds are 
very important to know in designing the machine of which the shaft is a part. For parameter 
values given here in consistent units, find the first two values of critical rotational speeds. 
Also find the curve y(x) at these speeds. 

E=2- 10° lb/in? —_g = 386.4 in/sec? 

1 = 0.46 in* L= l6in 

w = 0.025 Ib/in 
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Numerical Solution of Partial- 
Differential Equations 


7.0 CONTENTS OF THIS CHAPTER 


Chapter 7 introduces you to the important topic of partial-differential equations 
in which there are two or more independent variables. The class of equations 
we study here finds application in steady-state heat flow, fluid flow, electrical 
potential distributions, and so on. We apply two numerical methods to these 
problems: finite differences and finite elements. 


7A 


7.2 


7.3 


14 


EQUILIBRIUM TEMPERATURES IN A HEATED SLAB 

Presents a typical two-dimensional problem that illustrates one practical application 
of an elliptic partial-differential equation. 

EQUATION FOR STEADY-STATE HEAT FLOW 

Derives the equation that governs this situation, introduces you to some termi- 
nology, and shows how partial-differential equations are classified. 
REPRESENTATION AS A DIFFERENCE EQUATION 

Utilizes finite-difference approximations (Chapter 4) to change from a differential 
equation to a difference equation. 

LAPLACE’S EQUATION ON A RECTANGULAR REGION 

Illustrates how the finite-difference method leads to a system of algebraic equations 
that can be solved to give approximate values for the potential (temperature, 
concentration, voltage, or other potential quantity) at points within a rectangular 
region. The question of accuracy is explored. 
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7.5 ITERATIVE METHODS FOR LAPLACE’S EQUATION 
Applies iterative techniques (Chapter 3) to solve the system of equations that is 
often too large for elimination methods. You are introduced to the S.0.R. method, 
a means to accelerate the convergence of the iterations. 
7.6 THE POISSON EQUATION 
Is another form of elliptic equation that applies to problems such as torsion. 
membrane displacement, and so on. These can also be handled with the finite- 
difference technique. 
7.7 EIGENVALUES AND S.O.R. 
Applies matrices and their properties to the analysis of acceleration via $.O.R. 
7.8 DERIVATIVE BOUNDARY CONDITIONS 
Extends the finite-difference method to handle a larger class of boundary conditions. 
7.9 IRREGULAR REGIONS AND NONRECTANGULAR GRIDS 
Tackles real-world situations that do not fit neatly into the simpler approach taken 
in previous sections of the chapter. 

7.10 THE LAPLACIAN OPERATOR IN THREE DIMENSIONS 
Shows that, in principle, the finite-difference method extends readily to three- 
dimensional problems but points out that the size of the system to be solved may 
easily exceed the available computing power. 

7.11 MATRIX PATTERNS, SPARSENESS, AND THE A.D.I. METHOD 
Examines the structure of the matrices that result from finite-difference approxi- 
mations and introduces you to a newer method that offers significant advantages 
over the previous technique. 

7.12 FINITE-ELEMENT METHOD 
Takes an entirely new approach to solving elliptic partial-differential equations. It 
offers an alternative to finite differences that is often preferred. especially when 
the region is irregular. 

7.13 CHAPTER SUMMARY 
Lets you systematically review the topics of Chapter 7. 

7.14 COMPUTER PROGRAMS 
Implements the finite-difference method using both $.0.R. and A.D.1. techniques. 


7.1 EQUILIBRIUM TEMPERATURES IN A HEATED SLAB 


Many important scientific and engineering problems fall into the field of partial-differential 
equations. Here is a situation that is typical, in which the temperatures are a function of 
the coordinates of position of the point in question. 

A piece of metal is 12 in. x 3 in. X 6 ft. Three feet of the slab is kept inside a 
furnace but half of the slab protrudes (see Fig. 7.1). In order to decrease heat losses to 
the air, the protruding half is covered with a 1-in. thickness of insulation. If the furnace 
is maintained at 950°F, will all points of the metal reach a temperature of 800°F or higher, 


Figure 7.1 
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/ Portion of slab inside furnace 


Sheets of insulation around 
the metal slab, | in. thick 


in spite of heat loss through the insulation? Such a question might arise in heat-treating 
the slab when the only furnace available to heat the metal is too small to contain the 
whole slab. The sketch in Fig. 7.1 will explain the problem. 

You will recognize that this is a boundary-value problem: The temperature is known 
for the part of the slab inside the furnace, and something is known about the temperature 
gradients for the insulated part. The problem differs from those discussed in the previous 
chapter in that more than one independent variable is involved in specifying the coordinates 
of points within the slab. (Specifically. these variables are the x-, y-, and z-coordinates.) 

This chapter presents methods to solve this problem. These methods are generally 
referred to as finite-difference methods. However, since the 1950s an equally important 
method has been developed for solving elliptic partial-differential equations, namely, the 
finite-element method. This method is very suitable for computer coding and for handling 
very complex figures and irregularities. The finite-element method breaks the complex 
region into simple subregions such as triangles and pyramids and applies solution tech- 
niques to these simpler parts (see Vichnevetsky, 1981). 

There are well-known codes that use either or both of these methods. Among them 
are ELLPACK, which uses a variety of methods including those presented in this chapter, 
and TWODEPEP (IMSL). which uses the finite-element method for two-dimensional 
problems. However, in this chapter we consider mainly finite-difference methods; we do 
show how finite elements can be used. 

Many physical phenomena are a function of more than one independent variable and 
must be represented by a partial-differential equation. by which term we mean an equation 
involving partial derivatives. Most scientific problems have mathematical models that are 
second-order equations. with the highest order of derivative being the second. If u is a 
function of the two independent variables x and y, there are three second-order partial 
derivatives: 

Pu 
ax?" 


(We will treat only functions for which the order of differentiation is unimportant, so 
d°u/dy ax = d°u/ax ay.) Depending on the values of the coefficients of the second- 
derivative terms, partial-differential equations are classified as elliptic, parabolic, or hyper- 
bolic. The most important distinctions to us are the kinds of problems and the nature of 
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Uke 


Figure 7.2 


boundary conditions that lead to one type of equation or another. Steady-state potential 
distribution problems fall in the class of elliptic equations, which we discuss first. The 
equilibrium temperatures attained by the metal piece protruding from the furnace is such 
a potential distribution problem. 


EQUATION FOR STEADY-STATE HEAT FLOW 


We derive the relationship for temperature u as a function of the two space variables x 
and y for the equilibrium temperature distribution on a flat plate (Fig. 7.2). Consider the 
element of the plate whose surface area is dx dy. We assume that heat flows only in the 
x- and y-directions and not in the perpendicular direction. If the plate is very thin, or if 
the upper and lower surfaces are both well insulated, the physical situation will agree 
with our assumption. Let r be the thickness of the plate. 

Heat flows at a rate proportional to the cross-sectional area, to the temperature gradient 
(du/ax or du/dy), and to the thermal conductivity k, which we will assume constant at 
all points. The flow of heat is from high to low temperature, of course, meaning opposite 
to the direction of increasing gradient. We use a minus sign in the equation to account 
for this: 

Rate of heat flow into element at x = xo, in the x-direction: 


ou 
—kt ayy 


The gradient at x9 + dx is the gradient at xp plus the increment in gradient over the 
distance dx: 


Gradient at x9 + dx: 


Y 
t 
| 
\ dy 
Sepa ee ee 
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Rate of heat flow out of element at x = x) + dx: 


d (du 
ae ay|™ + (ac | 


Net rate of heat flow into element in x-direction: 


ou du du a7u 
A ay|% - (@ * (3) ax)| = Redeye 


Similarly, in the y-direction we have the following. 


tI 


Net rate of heat flow into element in y-direction: 


ou a (du i) 
-1 do | - (2 = (2 r+ He “) a) = kt de dy ® 


The total heat flowing into the elemental volume by conduction is the sum of these 
net flows in the x- and y-directions. If there were heat generated within the element, this 
would be added to the heat entering by conduction. The sum must be equal to the rate 
of heat lost by other mechanisms, or else there will be a buildup of heat within the 
element, causing its temperature to increase with time. In this chapter we will consider 
only the case where the temperatures do not change with time, the steady-state case 

If there is equilibrium as to temperature distribution (that is, steady state), the total 
rate of heat flow into the element plus heat generated must be zero. Hence, 


a2 


aru 
+ nt) + Or(dx\(dy) = 0, 


kndx(dy)( : 


where Q is the rate of heat generation per unit volume. (Obviously, if there is heat 
generation, there must be heat flow from the element by conduction that just balances 
it.) Q will often be a function of x and y. 

If Q = 0, we have 


The operator 


is called the Laplacian, and Eq. (7.1) is called Laplace's equation. For three-dimensional 
heat-flow problems, we would have, analogously, 
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Equation (7.2), which has been derived with reference to heat flow, applies as well 
to steady-state diffusion problems (where u is now concentration of material) and to 
electrical-potential distribution (where u is electromotive force); in fact, Laplace’s equa- 
tion holds for the steady-state distribution of the potential of any quantity where the rate 
of flow is proportional to the gradient, and where the proportionality constant does not 
vary with position or the value of u. In our examples, we will generally use terminology 
corresponding to heat flow as being more closely related to the average student's everyday 
experience. 

With Laplace's equation, we assume that no heat is being generated at points in the 
plate, as by electric heaters embedded in it, nor does any removal take place, as by 
cooling coils or other means. In the presence of such “sources” or “sinks,” we would not 
equate the net flow into the element to zero, as in Eq. (7.1), but the net flow into the 
element of volume would equal the net rate of heat removal from the element, if steady 
state applies. Assuming this removal rate to be a function of the location of the element 
in the xy-plane, f(x, y), we would have, with Q equal to the rate of heat generation per 
unit area, 


V4 = - tot, y) = f(x, y). (7.3) 


This equation is called Poisson's equation. Our numerical methods of solving elliptic 
differential equations apply equally well to both Laplace's and Poisson's equations. Ana- 
lytical methods, however, find the solution of Poisson's equations or even Laplace's 
equations with complicated boundary conditions considerably more difficult. These two 
equations include most of the physical applications of elliptic partial-differential equations. 

The distinction as to elliptic, parabolic, or hyperbolic for second-order partial-dif- 
ferential equations depends on the coefficients of the second-derivative terms. We can 
write any such second-order equation (in two independent variables) as 


Depending on the value of B? — 4AC, we classify the equation as 


7.3 
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Elliptic, if B? 


Parabolic, if B? 
Hyperbolic, if B? 


If the coefficients A, B, and C are functions of x, y, and/or u, the equation may 
change from one classification to another at various points in the domain. 

For Laplace’s and Poisson's equations, B = 0, A = C = 1, so these are always 
elliptic. We study the other types in later chapters. 

Equation (7.2) or Eq. (7.3) describes how u varies within the interior of a closed 
region. The function w is determined on the boundaries of the region by boundary con- 
ditions. The boundary conditions may specify the value of wu at all points on the boundary, 
or some combination of the potential and the normal derivative. 


REPRESENTATION AS A DIFFERENCE EQUATION 


One scheme for solving all kinds of partial-differential equations is to replace the deriv- 
atives by difference quotients, converting the equation to a difference equation. * We then 
write the difference equation corresponding to each point at the intersections (nodes) of 
a gridwork that subdivides the region of interest at which the function values are unknown. 
Solving these equations simultaneously gives values for the function at each node that 
approximate the true values. We begin with the two-dimensional case. 

Approximating derivatives by difference quotients was the subject matter of Chapter 
4. They may also be derived by the method of undetermined coefficients (Appendix B). 
We rederive the few relations we need independently. 

Let h = Ax = equal spacing of gridwork in the x-direction (see Fig. 7.3). We 

assume that the function f(x) has a continuous fourth derivative. By Taylor series, 


fly + h) = fla) + f"eph + Pde + Cea yp + is, 


Xn SE, SX FA, 


F(X, — A) = f(x) — f' ph + 


LC 2 _ fF" On) 3, LMG) 
eke = eee + Gr 


Kp ht Sky Se 
It follows that 


= 9 _ 1, 
Gin + 0) ote + $0 = 1) = proyy 4 f Oe, toh<E<a th. 


*This is exactly the same as the method of solving a boundary-value problem through replacing it by difference 
equations, as discussed in Chapter 6 
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Figure 7.3 


Axe 


Na 


A subscript notation is convenient: 


fast — 7h * Sood = px + O(n), (7.4) 


In Eq. (7.4) the subscripts on f indicate the x-values at which it is evaluated. The 
order relation O(h?) signifies that the error approaches proportionality to h? as h > 0. 
Similarly, the first derivative is approximated, 


+ h) — fe, - 
fn )= fl Wy _ poe + 


Snvi = fn=1 = 
2h 


Oya, X,-h<&<x, +h, 


fi, + O(n). (7.5) 


(The first derivative could also be approximated by the forward or backward difference, 
but this would have an error of O(h). We prefer the more accurate central-difference 
approximation.) 

When f is a function of both x and y, we get the second partial derivative with respect 
to x, d°u/dx?, by holding y constant and evaluating the function at three points where x 
equals x,, x, + h, and x, — h. The partial derivative ?u/dy? is similarly computed, 
holding x constant. We require that fourth derivatives with respect to both variables exist. 

To solve the Laplace equation on a region in the xy-plane, we subdivide the region 
with equispaced lines parallel to the x- and y-axes. Consider a portion of the region near 
(x;, y;). (See Fig. 7.3.) We wish to approximate 
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Replacing the derivatives by difference quotients that approximate the derivatives at the 
point (x, y,), We get 


U(Xj+ 15 Yj) — 2uCaj, yj) + UOi-1, Y/) 
(Ax? 


4 HG Yue) — 2u(xi, ¥4) + us, Yj-1) _ 
(Ay)? 


VPulx;, ¥) = 


0. 


It is convenient to let double subscripts on u indicate the x- and y-values: 


Mi+n.j — 2uij + Wy fap Qujj + Ui j-1 
(Ax? (Ay? 


='0, 


2 = 
Vu; = 


It is common to take Ax = Ay = A, resulting in considerable simplification, so that 


2 1 
Vu; = elisa + upg F Mi jer t+ My j-1 — 44.) = 0. (7.6a) 


Note that five points are involved in the relationship of Eq. (7.6a), points to the 
right, left, above, and below the central point (x;, y;). It is convenient to represent the 
relationship pictorially, where the linear combination of u’s is represented symbolically. 
Equation (7.6a) becomes 


= 0. (7.6b) 


The representation of the Laplacian operator by the pictorial operator that represents 
the linear combination of u-values is fundamental to the finite-difference method for 
elliptic partial-differential equations. The approximation has O(h?) error, provided that u 
is sufficiently smooth. This formula is referred to as the five-point or five-point star 
formula. 

In addition to the five-point formula just given, one can derive the nine-point formula 
for Laplace’s equation by similar methods to get 


In this case the approximation has O(h®°) error, provided that u is sufficiently smooth. 
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A subscript notation is convenient: 

Pont ip "i ~ 
Srv fe tn =f" + On). (7.4) 


In Eq. (7.4) the subscripts on f indicate the x-values at which it is evaluated. The 
order relation O(h?) signifies that the error approaches proportionality to h? as h > 0. 
Similarly, the first derivative is approximated, 


(x, +h) — fy — hb) _ ,, 
fn Wo =F") + 


Sv — fae 
2h 


Bie. % he be x, h, 


fn + Olh?). (7.5) 


(The first derivative could also be approximated by the forward or backward difference, 
but this would have an error of O(h). We prefer the more accurate central-difference 
approximation. } 

When f is a function of both x and y, we get the second partial derivative with respect 
to x, d°u/dx?, by holding y constant and evaluating the function at three points where x 
equals x,, x, + h, and x, — h. The partial derivative d?u/dy? is similarly computed, 
holding x constant. We require that fourth derivatives with respect to both variables exist. 

To solve the Laplace equation on a region in the xy-plane, we subdivide the region 
with equispaced lines parallel to the x- and y-axes. Consider a portion of the region near 
(x;, y;). (See Fig. 7.3.) We wish to approximate 
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Replacing the derivatives by difference quotients that approximate the derivatives at the 
point (x;, y,), we get 


u(Xi+1, Yj) — 2uCai, yj) + UOG-1, ¥/) 
(Ax? 


ux, Yori) — 2uCxi, yj) + ui Yj) _ 
i (ay . 


Vulx;, ¥) = 


0. 


It is convenient to let double subscripts on u indicate the x- and y-values: 


es Luh fe Mi iay y Mitt — 2uiy + Wijmd 
(Ax? (Ay)? 


=0. 


2 - 
Vu; = 


It is common to take Ax = Ay = A, resulting in considerable simplification, so that 


1 
2 = _ - 
V7u; 5 = alts + ayy F Mi jer + Mj. — 4) = 0. (7-6a) 


Note that five points are involved in the relationship of Eq. (7.6a), points to the 
right, left, above, and below the central point (x;, y,). It is convenient to represent the 
relationship pictorially, where the linear combination of u’s is represented symbolically. 
Equation (7.6a) becomes 


=0. (7.6b) 


The representation of the Laplacian operator by the pictorial operator that represents 
the linear combination of u-values is fundamental to the finite-difference method for 
elliptic partial-differential equations. The approximation has O(h?) error, provided that w 
is sufficiently smooth. This formula is referred to as the five-point or five-point star 
formula. 

In addition to the five-point formula just given, one can derive the nine-point formula 
for Laplace’s equation by similar methods to get 


In this case the approximation has O(h®) error, provided that w is sufficiently smooth. 
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We can solve Eqs. (7.9a) by elimination. It turns out that the largest element always 
occurs on the diagonal, so pivoting isn’t required. The solution vector is 


vy = 0.3530, vg = 0.4988, v5 = 0.3530, 
v2 = 0.9132, vg = 1.2894,  vy46 = 0.9132, 
= 2.0103, v9 = 2.8323, > = 2.0103, 
ve = 4.2957, v4, = 6.0193, yg = 4.2957, 
vs = 9.1531, v2 = 12.6537, yg = 9.1531, 
Vg = 19.6631, v3 = 26.2893, vyq = 19.6631, 
v7 = 43.2101, vy, = 53.1774, vp, = 43.2101. 


¥ 


Symmetry in the system is apparent by inspection, and we could have reduced the 
number of equations by taking this into account. The regular pattern of coefficients in 
the matrix of Eqs. (7.9a) would have been lost. however. 

We now are able to examine the effect of grid size h. Three points are identical in 
the two sets of equations. We show these results side by side in the first two columns of 
Table 7.1. 


7.4; LAPLACE’S EQUATION ON A RECTANGULAR REGION 483 


Table 7.1 


From Eq. (7.8), 
with h = 5 cm 


From Eg. (7.9a), 
with h = 2.5 cm 


From Eq. (7.9b), 
with h = 2.5 cm 


Point 
(Fig. 7.5) 


Analytical 
value 


The answers with h = 2.5 cm show significant improvement, as expected. In fact, 
since the error in approximating the derivatives by central-difference quotients is of O(/), 
we might anticipate that the errors in the solution through sets of difference equations 
would also vary as h*; this is roughly true. When we have such knowledge of the effect 
of h, an extrapolation can be made. When this is done, temperatures within 3% of the 
analytical values are obtained. 

Another way to reduce the error is to use the nine-point formula for the Laplacian, 
Eq. (7.6c). Doing so gives the matrix equation of Eq. (7.9b); here we assumed values 
for the corner points equal to the average of adjacent boundary points. The solution to 
this gives much better accuracy. Table 7.1 shows three of these values. 
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-20 4 1 4 1 v 0 
-20 4 14 1 vy 0 
4-20 4 14 1 Vy 0 
4-20 4 1 4 1 Vs 0 
4-20 4 1 4 1 6 0 
4-20 1 4 Vy —550 
441 -20 4 41 ve 0 
1 4 1 4-20 4 1 4 1 Vg 0 
1441 4-20 4 141 io 0 
i @ 4 4-20 4 141 vi f=] 0 
1 4 1 4-20 4 1 4 1 % 0 
1 4 1 4-20 4 1 4 Vivi 0 
1 4 4-20 1 Alu, —600 
4 1 -20 4 V5 0 
1 4 1 4-20 4 V6 0 
1 4 4 4-20 4 V7 0 
141 -20 4 Vig 0 
1 4 1 4-20 4 Wg 0 
i @ oo 4-20 4] lv 0 
L 1 4 4 -20]| rs, =550 


MBE CHAPTER 7 NUMERIDAL EDLUTIOW > PARTIALDEFESRENTIAL SUTIN 


Gp prem e. fren. we can ee ie nuereca) eine ep soe ie seas eo 
fiow cquation to anwessired accumacy. all we meer Ge dows to comma oocmaaike sales. 
Jn practice, however: ths procetiere mums unto severe iffimitiess. lhe wen apparent 
‘the Ube: 07 couations mcreasss Imontimatchy Tas. With = 1 5. we woniitliieve 1S 
discrete tenor poms, with # = D205 we bee 46. anti soon Somme simmer with 
105 cows anc 105 cofmns would cregure 1 co asow: 01 08D wort: 4 BRD yess) oF 
COMmpate: memory, with 465 -comations we would ced 2be NBD weortts (5 OD ineess). 
Fev compute: systems aliow ass Such 2 Sener: Pano. ERC OveTiavINe EMO seer 
rom desi storage wouid he sxtrenhy ims-consumme | Tere mass: ee ete way. vonw 
re thmime. anc thers. Along with Memory TeguIrements we wry inne eect 
Tims. 

The remedy tote problem of overiargs= mamicss stn take al wantas O feSeSESSS 
exhibited py Eqs. 79g). That system of 2) cquanons aes omer tems mony 5 cot 
of 44) stemenss oor 18% Tne dargemanmicss fhe rsmitownen fm amlic nae ane 
sSmalic: -proporuon of mourer tems (4% anc }S with US ant 45 <quanem osm 
ely). Dnforumarteiy. the systems are nocmnitizazonal. (Ftrey were wee conti work: weit 
onty ‘the monzero ‘terms, 2s we Gout ap Chapee: 2 With Shamtiet matress: such as wee 
‘have here. none. of fhe Zero. Siemens pussuic fhe ants chamess “baci cient eewessn 
the side bands do fil! up with menzeros. Tre preferred neetned ter apes nutes poms 
1 BUT Mesh 35 feration. which we Gisscus: mext 


ITERATIVE METHODS FOR 
LAPLACES SQUETION 
When thr: steady-stat heat-fiow couaien m Iwo dimension. 
Wie= 22 23 @ 
= ae 


2 2 set of grid poms tx,_-y)). eect! emittance: mea couationsetsaiis 2 Eps 7B) 
anc 79) dlimerate Proper onsen of te-souatiom gress. 2 daxgomaln common sestm 
(Dnagonal dommance necans ‘that fie maentteiic of fie ceciiicrem on fe Gazonas at 
iss ‘than ‘the «sum of fie muaenimatiss .0f fie ote ceefiicrees om ‘fir saan ow) A wee 
saw am Chapee: 2. such systems cap ‘ee sotesd ny Gans:—Seate! reeranon 

When applied to Lapiacs‘s equation. ‘fee itenaties techmamer ss calliet Lecpeoer = 
method We ilisstrat the procedmre with fix same prapiem 2m fie das eecnon Te 


7.5: ITERATIVE METHODS FOR LAPLACE’S EQUATION 485 


difference equations, when h = 5 cm, are 


—4u,+ uw =0, 
uy — 4u. + un = 0, 
uy — 4u, = —100. (7.10) 


The first step in applying Liebmann’s method is to rearrange these equations into a new 
form by solving each equation for the variable on the diagonal: 


uy 
“4 = 4 
uy + U3 
ge 
- 4 
uy + 100 
w= (7.11) 
Observe that the pictorial operator can give this rearrangement directly: 
: Mee + May t Mit FM 
1 -4 Leu, = 0> uy = : rT . . (7.12) 


We solve Eqs. (7.10) by beginning with some initial approximation # to u, the 
solution vector, substituting values from this approximation into the right-hand sides of 
Eqs. (7.11), and generating new components of # that converge to the solution. (Con- 
vergence is usually faster if each new component of @ is used as soon as it is available. 
The former procedure is called the Jacobi method and the latter is called Gauss-Seidel.) 

The better the beginning approximation, the more rapid the convergence, but, because 
the system is diagonally dominant, even a poor initial vector will eventually converge. 
Inspection of the problem underlying the equations lets us guess at reasonably good 
starting values: we know that u, will be small, that 2 will be about four times as large 
as u,, and that w; will be a little more than one fourth of 100. Suppose we begin with 
uz = 30, uw, = 7.5, u, = 2. Performing the iterations gives the values shown in Table 
7.2. 

The iterations converge as anticipated and to the same values as in the previous 
section when we solved the equations by elimination. We terminate the iterations when 
all components of # converge to constant values. For only three equations, iteration is 
surely more computational effort than elimination, but for large systems there are important 
advantages. 

The most obvious advantage is a greatly reduced demand for computer memory 
space. In our example problem, if h is 1.25 cm, there are 105 interior points at grid 
intersections. To store the full 105 x 105 matrix, we need 44,000 bytes, not even taking 
into account the right-hand sides and the solution vector. With Liebmann’s method, we 
need to store only the 105 values of u,; plus 44 boundary values: this requires about 600 
bytes. The difference in storage requirements is even more striking when h = 0.625 cm. 
The 865.000 bytes for the full matrix is reduced to a very moderate 2300 bytes. 
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Table 7.2 


My bee 3 ins 
Initial values 

2 75 30 
1.875 7.969 26.992 
1.992 7.246 26.812 
1.812 7.156 26.789 
1.789 Tis 26.786 
1.786 7.143 26.786 
1.786 7.143 26.786 


With large systems there is also a very significant reduction in execution time as 
well, although the magnitude of this depends on how good a starting vector we can 
supply, and very strongly on the precision we require. It is customary to terminate the 
iterations when the largest change in any component of # meets a supplied tolerance 
value. This tolerance should be chosen with consideration for the importance of precision 
in the final answers (many engineering applications do not need extreme accuracy) and 
only after determining whether the parameters of the problem (boundary conditions, 
dimensions, material properties, and so on) are themselves known with great accuracy. 

Since the speed of convergence depends on the initial value of i, it is frequently 
worthwhile to spend some effort in obtaining a good initial estimate. Some guidelines 
can be helpful. It is obvious, since the temperature at each point is the average of 
temperatures at surrounding points, that no interior point can have a temperature greater 
than the hottest boundary point, or a temperature /ower than the coldest. Sometimes the 
simple expedient of setting all interior points equal to the mean of all the boundary 
temperatures is used to start iterations. Interior points near cold edges have low temper- 
atures and those near hot edges have higher temperatures. This may let us make good 
guesses from which to begin iterations. A more refined technique solves the problem 
with a coarse mesh, getting a quick solution; these values then are used as a basis to fill 
in initial values at intermediate points. 

When our sample problem was solved with h = 2.5 (21 interior points), employing 
the average of boundary values as the starting temperatures, these results were obtained 
after 21 iterations. (The iterations were stopped when the largest change in any # com- 
ponent was less than 0.001.) 


u, = 0.354, us = 0.500, uy = 0.354, 
uz = 0.914, Ug = 1.291, tye = 0.914, 
ig = 20D, ~ Hyp = 2 a = ZO; 
us = 4.297, my, = 6.021, yg = 4.296, 
us = 9.154, Uy2 = 12.655. yg = 9.154, 
ug = 19.664,  uy3 = 26.290, —ttgq = 19.663, 
uy = 43.210, yg = 53.178, uy = 43.210. 
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Comparison with the values in Section 7.4 shows that the same results are obtained 
as when the equations are solved by elimination. 

Liebmann’s method essentially takes care of the problem of memory requirement for 
the steady-state heat-flow problem (though three-dimensional problems can still be 
demanding). The chief drawback in Liebmann’s method is slow convergence, which is 
especially acute when there are a larger number of points, because then each iteration is 
lengthy and, in addition, more iterations are required to reach a given tolerance. 

The relaxation method of Southwell, as discussed in Section 2.11, would be a way 
of attaining faster convergence in the iterative method. (In fact, Southwell developed his 
method to solve potential problems.) As pointed out in the discussion of that section, 
relaxation is not adapted to computer solution of sets of equations. Based on Southwell’s 
technique, the use of an overrelaxation factor can give significantly faster convergence, 
however. Since we handle each equation in a standard and repetitive order, this méthod 
is called successive overrelaxation, frequently abbreviated as $.O.R. 

To show how successive overrelaxation can be applied to Laplace’s equation, we 
begin with Eq. (7.12), adding superscripts to show that a new value is computed from 
previous iterates, 


ik) (erly (ke) (k+1) 
ty) — Mery * My Ft Mije FM 
Wit D = = ; 


ra) 


We now both add and subtract u/ 


on the right-hand side, getting 


= iL.) ijt ijl if 
u uy) + 4 


[se it WED 4 Oe Yet — su — 
: (7.13) 


(The numerator term will be zero when final values, after convergence, are used, The 
term in brackets is what Southwell called the “residual,” which he “relaxed” to zero by 
changes in the temperature at (x;, y,).) 

We can consider the bracketed term in Eq. (7.13) to be an adjustment to the old 
value ui), to give the new and improved value ie If, instead of adding just the 
bracketed term, we add a larger value (thus “overrelaxing”), we get faster convergence. 
We modify Eq. (7.13) by including an overrelaxation factor w to get the new iterating 
relation 


ik) (k+1) (kK) (k+1) (ky 
+ + + - 
Wer) = fh + of us wn t Mj + Wp — 4G |: 


hj 4 (7.14) 


Maximum acceleration is obtained for some optimum value of w. This optimum value 
will always lie between 1.0 and 2.0. 

Table 7.3 shows some results that demonstrate how S.O.R. can speed up the con- 
vergence in our example problem for h = 2.5 (21 interior points) and for h = 1.25 (105 
interior points.) 
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Table 7.3 


Effect* of overrelaxation factor on speed of convergence 


h=2.5cm h= 1.25 cm 
(21 interior points) (105 interior points) 

Number of Number of 

o iterations wo iterations 
1.00 20 1.00 70 
1.10 15 1.10 58 
1.20 13 1.20 46 
1.30 12 1.30 35 
1.40 15 1.40 29 
1.50 18 1.50 26 
1.60 23 1.60 28 
1.70 36 

oq. at about 1.3 | op at about 1.5 


he 


*The example problem using $.O.R. with iterations was continued until maximum change in any component of @ was less 
than 0.001 


The optimum value for w is not always predictable in advance. There are methods 
of using the results of the first few iterations to find a value of w that is near optimum.* 
We do not discuss these here. For a rectangular region having constant boundary conditions 
(Dirichlet conditions), a reasonable estimate of the optimum w can be determined as the 
smaller root of the quadratic equation 


(cos : + cos a — l6w + 16 = 0, (7.15) 


where p and q are the number of mesh divisions on each side of the rectangular region. 
Solving the above equation for w gives 


with 


For our example problem, Eq. (7.15) predicts w,,, = 1.267 for h = 2.5 and wo, = 
1.532 for h = 1.25. We see from Table 7.3 that there is good agreement with actual 
results. 

To show the range of values for w,,, that are customary in steady-state problems, 
we show in Table 7.4 solutions to Eq. (7.15) for several values of p and q where p = q 


(a square region). 


*Hageman and Young (1981) present details. We also consider S.O.R. and the optimum w as an eigenvalue 
problem in Section 7.7. 


Table 7.4 


7.6: THE POISSON EQUATION 489 


SEE 
Value of 

B=i¢d opr 

2 1.000 

3; 1.072 

5 1,260 

10 1.528 

20 1.729 

100, 1.939 

x 2.000 


ed a somewhat intuitive approach to $.O.R. and values for w, 
the overrelaxation factor. Actually this is an eigenvalue problem, as you will see in 
Section 7.7. 


7.6 THE POISSON EQUATION 


The methods of the previous section are readily applied to Poisson's equation. We illustrate 
with an analysis of torsion in a rectangular bar subject to twisting. The torsion function 
¢ satisfies the Poisson equation: 


V2b+2=0, =n boundary.* (7.16) 


The tangential stresses are proportional to the partial derivatives of # for a twisted 
prismatic bar of constant cross section. Let us find @ over the cross section of a rectangular 
bar 6 * 8 in in size. 

Subdivide the cross section into 2-in squares, so that there are six interior points, as 
in Fig. 7.7. In terms of difference quotients, Eq. (7.16) becomes 


1 
1-4 11g, +2=0, 
1 


or 


1 =4 Id t+ 2= 0: (7.17) 
1 


fi 


(The function ¢ is dimensional; with our choice of h, @ will have square inches as units.) 


*The second term in Eq. (7.16)—here equal to 2—actually should be multiplied by the angular twist per unit 


length and by the modulus of rigidity for the material. 
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Figure 7.7 


The set of equations, when (7.17) is applied at each interior point, is 


0 +0 +6246) -46,+8=0, 
O,+0 +0 + dy — 464. + 8 =0, 
O + dy + by2 + b3; — 462, + 8 = 0, 
by; + O12 +0 + 32 — 46.2 + 8 = 0, 
0 + b+ b32+0 - 463,+8=0, 
ot b2+0 +0 —4¢32+8=0. (7.18) 


Symmetry considerations show that d,; = 6,2 = 63; = @32 and @, = b>, so 
only two unknowns are left in (7.18) after substitutions: 
2, — 36,5, + 8 = 0, 
26;; — 362, + 8 = 0. 
Obviously these would be solved by elimination; d;, = 4.56, ¢); = 5.72. 
Now let us solve the problem with a 1-in-square mesh using iteration. To use Lieb- 
mann’s method, we need initial estimates of d. As shown in Fig. 7.8(a), we estimate 


these, using our previous results as a guide. We converge in 25 iterations (tolerance = 
0.001) to the values shown in Fig. 7.8(b), employing Eq. (7.19):* 


$,; = Foie.) + G-15 + Gi jer + Oj-1 t+ 2). 319) 


*Equation (7.19) differs from Eq. (7.12), the Liebmann method for Laplace's equation, only in that the 
f(x, y) term of Poisson's equation must be included. 
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14 26 30 26 14 2.042 3.047 3.353 3.047 2.043 
25 46 50 46 25 3.123 4.794 5.319 4.794 3,123 
3B..352 535 52 30 3.657 5.686 6.335 5.686 3.657 
35 5.7 60 5.7 3.5 3.181 5960 6.647 5.960 3.818 
3.0 52 355 52 30 3.657 5.686 6.335 5.686 3.657 | 
25 46 50 46 25) 3.124 4.794 5.319 4.794 3.124 
14 26 30 26 14 2.043 3.048 3.354 3.048 2.043 | 


(a) Inamal values of © for iteranon (>) Final value for 3 by merative methods 


Figure 7.8 Torsion in a rectangular bar, mesh size of | in 


As we have seen before, applying an overrelaxation factor should speed up the 
convergence. Accordingly, we rewrite Eq. (7.19) with an overrelaxation factor: 


The optimum overrelaxation factor calculated from Eq. (7.15) is 1.383. Using this in 
Eq. (7.20). we converge in 13 iterations to the same set of values as before and as 
tabulated in Fig. 7.8(b). Using S.O.R., rather than the standard Liebmann’s method. cuts 
down the number of iterations by nearly 50%. 

The nine-point formula for the Poisson equation, 


V76 + f=0. 


provided that f is a function of x and y only. The truncation error in this case is still 
Ok). Equations (7.17) and (7.18) become 


iff 4 1 
zajt ~20  4fb,+2=0, 
I 4 1 


and the set of equations becomes 
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re 


—2dy+ 4b42+ 46+ by 
461, — 20d + x1 + 4do2 
4bir + biz — 2063, + 4¢22+ 4631+ — dyp 


1 + 4b). + 462, — 206.. + 5; + 4632 + 48 = 0, 
462, + by) — 2063, + 4632 + 48 = 0, 
br, + 4632 + 465, — 20632 + 48 = 0. 


EIGENVALUES AND S.O.R. 


The S.O.R. method used in Sections 7.5 and 7.6 is actually a special case of the relaxation 
method of Section 2.11. In fact, Eqs. (7.14) and (7.20) are just special cases of Eq. 
(2.28), which we repeat here: 


| get | 


n 
Mk+1) = yih) Lad ke « 
x =x + 2(6,- ¥ aya — Bap), 


ii Is 


Since this equation reduces to Gauss—Seidel for w = 1, we will see in this section 
that the purpose of introducing w is to reduce the eigenvalues of the iteration matrix in 
order to accelerate convergence (as explained in Section 6.7). 

Let us review the argument of Section 6.7 about speed of convergence. We prefer 
to work with matrices rather than the individual equations. We start with Ax = b, where 
Aisn X n. If we rewrite A as L + D + U, then 


Ax = (L + D + U)x = b, or 
(L + D)x = b — Ux. 


We can put this into an iterative form: 


(L + D)x**) = b — Ux®, 
Dxi®)) =b- Lx®) om Ux®, (7.22) 
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giving the Gauss-Seidel equation, Eq. (2.23). In using this, we begin with an initial 
approximation x). We see that Eq. (7.22) can be written as 

xD = pi — B®, bh’ = (L+ Dy 'b, B= (L+ DY 'U. 
Suppose that ¥ is the solution to Ax = b. Then X certainly satisfies 
xX =b' — Bx. 


Subtracting this from Eq. (7.23), we get 


xk) — % = okt) = —B(xH — x) = —Be, 
where e“) = x“ — x. This means that 
kt) = — Bel) = Brk-) =... = (Byte 


with e = x — X, the error in the initial approximation to the solution. Now, assuming 
that we can write e°) in terms of the eigenvectors of B, where By, = Ajv;, 


& = ey, + envy HO + Cpe 
Now 
Bke = c,Bky, + cB, + +++ + ¢, By, 
= cyAky, + cgAyy + 2+ + cy ALY, 


and we see that, if all the eigenvalues of B are less than | in magnitude, the iteration 
will converge and that the rate of convergence is faster when the eigenvalue of greatest 
magnitude is small. We can speed up the convergence if we can find a way to reduce 
the eigenvalue of maximum magnitude of B. 

Consider now a simple example that shows how the relaxation factor can reduce the 
eigenvalues of the iteration matrix and hence accelerate convergence. Let Ax = b be 


(i sf = [5] (solution is {4 —2}).* 


First, let us relate Eq. (7.21) (single-equation form for overrelaxation) to Eq. (7.22) (the 
matrix form for iteration). Let A = L + D + U and write 


wb — Ax) = 0 = w(b — (L + D + U)x). (7.23) 


(Actually, b — (L + D + U)x is the residual in Southwell’s relaxation method of Section 
2.11.) Obviously, we can write 


Dx = Dx. 
Add (7.23) to both sides: 


Dx + 0 = Dx — wLx — wDx — wUx + wb. 


*In this section, braces indicate a column vector. 
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Rearrange to give the iterative form: 
(D + wL)x**Y = [(1 — w)D — wU]x™ + wb, or 
x@) = (D — wL) [1 — w)D — wU]x® + wD + wl)'b, (7.24) 


This is the matrix formulation of Eq. (7.21) for overrelaxation. 
Let us apply this to the simple problem, Ax = b, where 


a) |e [Cj] RT a ie OS th 
a-[i she-[she-[o she=[7 che=[ of: 
In the form of Eq. (7.24), this is 


2 0] as [21-0 -wo | 6 
be fe 0 ao Ll; 


———— 


ee 
(D + wl) (D — wD — wU) wb 


The inverse of (D + wL) is not difficult to compute. Doing so and multiplying the 
matrices gives 


l-@ —w/2 30 
(k+1) = (k) a i iB 
# - = 1/3 (w/6- w+ vk + [-w : hal 7:28) 
(You should verify that. for w = 1, this reduces to the Gauss-Seidel matrix.) In fact, 


this is 


whose eigenvalues are fo 4 


=—— 
oo 
| 

a= i= 


We wish to determine the optimum value w,,, of the overrelaxation factor in the matrix 
of Eq. (7.25) that reduces the largest magnitude eigenvalue of the matrix as much as 
possible. To do this, recall four properties of matrices: 


det(AB) = det(A) det(B). 
det(A) = product of its eigenvalues. 


trace(A) = sum of diagonal elements = sum of its eigenvalues. 


The eigenvectors of a matrix A form a basis for R” if the matrix can be 
diagonalized. We used this earlier. 


ile 
2 
33 
4. 


Applying these to the matrix of Eq. (7.25), we have 


AyAz = (@ — 1), 


7.8 
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where 0 < w < 2. This follows because we want all the eigenvalues to be less than | 
in absolute value. Moreover, since we want the largest eigenvalue as small as possible, 
we take 


M=A=o- 1 


By the trace property, it follows that 


This quadratic equation has one solution @,,, = 1.0455488, which falls in the range from 
0 to 2. 

To summarize what we have done: We wanted to solve Ax = b by iteration. If we 
had used Gauss-Seidel, our iteration matrix would be 


04 , 
/ with eigenvalues 0 and i 
6 


Starting with x = {0 0}, succeeding iterates are {3 —1.6675}, {3.8333 —1.9444}, 
{3.9722 —1.9907}, > {4 —2}. 
However, by using Eq. (7.25) with w = 1.0455488, the iteration matrix becomes 


ee -0.5228 


0.01599 0 fol with eigenvalues 0.0455 and 0.0455. 


Starting again with {0 0}, we get for x, {3.1366 —1.7902}, for x, 
{3.9296 -—1.9850}, then {3.9953 -—1.9991}. Reducing the largest-magnitude 
eigenvalue from i to 0.0455 gives faster convergence. If the reduction in magnitude 
had been greater, a more significant speeding up would have been observed. 

You will do another example as an exercise. For larger matrices, a different approach 
to finding the optimal ,,, would be necessary. Our purpose here is only to interrelate 
eigenvalues of the relaxation matrices to the convergence that is obtained. We discussed 
one of these other methods for eigenvalues in Chapter 6, the power method. We also 
described the QR method in Chapter 6; this is based on similarity transformations. The 
selected references listed at the end of this chapter will give you further insight. 


DERIVATIVE BOUNDARY CONDITIONS 


In the previous examples, we solved for the steady-state temperatures at interior points 
of a rectangular plate, with the boundary temperatures being specified. In many problems, 
instead of knowing the boundary temperatures, we know the temperature gradient in the 
direction normal to the boundary, as for example when heat is being lost from the surface 
by radiation and conduction. 
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EXAMPLE 


Figure 7.9 


Consider a rectangular plate within which heat is also being uniformly generated at each 
point at a rate of Q cal/cm? - sec. The plate is steel and is 4 x 8 cm, 1 cm thick. k = 
0.16 cal/sec - cm? - °C/cm. For this situation, Poisson’s equation holds in the form 


Suppose that Q = 5 for our example. The top and bottom faces are perfectly insulated 
so that no heat is lost, while the upper and lower edges lose heat so that 


ou . 
ae 15°C/em 
in the outward direction, and the right and left edges are held at a constant temperature 
u = 20°C. We will find the steady-state temperatures at points in the plate. 
In Fig. 7.9 we sketch the plate, with a gridwork to give 21 interior points with h = 
1 cm. In this problem, the upper and lower edge temperatures are also unknown, increasing 
the total number of points at which u is to be determined to 35. Because of symmetry, 
the equations can be written in terms of 12 quantities; these are indicated on the diagram, 
We now write Eq. (7.26) at each unknown point, which is our general procedure in 
elliptic partial-differential equations, except that at the upper and lower edges we cannot 
form the five-point combination—there are not enough points to form the star. We get 
around this by the device of extending our network to a row of exterior points. We utilize 


e ° e ° e e Sai 
uy Uy Wy Uy Uy Uy uy 
20 +. ° ° * ° * 20 
Us Ue Ua Ug Uy Us Us 
20 e e e e e e e 20 
Ug Uo uM) Uy Wy Uo Ug 
20 e e e e e e 20 
Us Us is Ug uy Us Us 
20 e e e e e e e 20 
uy Ua uy Uy U3 ua uy 
20 ° ° * *—_* e * 20 
ou 
—=15 
e e e e e e e e ay 
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these fictitious exterior points to include in the set of equations those whose central point 


is on the upper or lower edge: 


(20 + ug 
(uy + uy 
(uy + u, 
(uy + ug 
Q0 + 4, 
(us + Uy 
(ug + uy 
(uz + Uy 
(20 + us 
(Ug + Ue 
(uy + Uy 
(yy + ug 


+++ +++ + + 


uy 
+ hoe 
+ uy 
+ 2 


+ iy 


us — 4u,) = —5/0.16, 
Ug — 4ux) = —5/0.16, 
uz — 4u3) = —5/0.16, 
ug — 4us) = —5/0.16, 
Ug — 4us) = —5/0.16, 
Uyo — 46) = —5/0.16, 
uy, — 4u7) = —5/0,16, 
yy — 4ug) = —5/0,16, 
us — 4g) = —5/0.16, 


Ug — 4uyo) = —5/0.16, 
uz — 4u,;) = —5/0.16, 
ug — 4uy2) = —5/0.16. (7.27) 


It would appear that we have not helped ourselves by the fictitious points outside 
the plate. We have made it possible to write twelve equations, but four new unknowns 
have been introduced. However, we have not yet utilized the gradient conditions. Write 
the derivative conditions as central-difference quotients, as discussed in Chapter 4: 
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= 15; uy, = Ue 


= 15, uu. = Wy 
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Ug = Ug 


30, 


30. (7.28) 


We write these approximations for the gradient condition, choosing the proper order 
of points so that the outward normal is negative, since heat is flowing outwardly. Using 
a central-difference approximation of error O(h?) makes these compatible with our other 
difference-quotient approximations. Each of the difference quotients allows us to write 
the fictitious temperature values in terms of points within the rectangle. 

We solve the set of equations in (7.27) after eliminating the fictitious points by using 
Eq. (7.28). Elimination, iteration (Liebmann’s method), or relaxation may be used. The 
solution of the equations is left as an exercise. © 
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7.9 


Figure 7.10 


We can apply the same method to the more general boundary condition, 


~ 


au 
au — bo =e, 


where a, b, and c are constants, in an obvious application of the relationships of Eq. 
(7.28). 


IRREGULAR REGIONS AND NONRECTANGULAR GRIDS 


When the boundary of the region is not such that the network can be drawn to have the 
boundary coincide with the nodes of the mesh, we must proceed differently at points near 
the boundary. Consider the general case of a group of five points whose spacing is 
nonuniform, arranged in an unequal-armed star. We represent each distance by 6,h, where 
6; is the fraction of the standard spacing A that the particular distance represents (Fig. 
7.10), Along the line from u; to up to us, We May approximate the first derivatives: 


(2) = Mo > 4, (*) = M3 — Ko 
1-0 0-3 


ox Oh ax Oh 
Since 
ce ae 
2 ax\ ax 
we have 


du _ (U3 — o)/O5h = (uy — uy)/Oyh _ al My = Uy, uy “Uo | (7.29) 


ax? Ke, + 6h WL 80, + 8)” 8, + 65) 
Similarly, 
Fu - 2 Uy — Uo Us — Uo 7 
ay? ath: F 6) > 0(8; + 63)|° sg: 


The expressions in Eqs. (7.29) and (7.30) have errors O(h), which introduce larger 
errors in the computations than for points that are arranged in an equal-armed star. 


Uy 


Oh 63h 


Figure 7.11 
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Combining, we get 


Vu = fi 
ox ay? 
_ Wo 2 uy U4 
a0, + 0) * 050; + Oy)” O40, + Os)» ONO) + A) 
I 
chy fe 
“(a7 aa.) (7.31) 


We use the operator of Eq. (7.31) for points adjacent to boundary points when the 
boundary points do not coincide with the mesh, instead of our standard operator. If the 
boundary conditions involve normal derivatives, great complications arise, especially for 
curved boundaries. The finite-element method of Section 7.12 offers less difficulty. 

Let us illustrate this procedure with an example. A semicircular plate of radius a has 
the temperature at the base (the straight side) held at 0° while the circumference is held 
at c°. We desire the steady-state temperatures. The analytical solution to this problem is. 
given by the infinite series 


2n-1 
acy _ 1 (5) in Qn — 32 
moh In = 1\a sin (2n — 1)0, (7.32) 


u(r, 0) = 
where (r, @) are the polar coordinates of a point on the plate. (For points near the 
circumference, several hundred terms are needed in order to compute temperatures with 
any degree of accuracy.) There is right-to-left symmetry in the temperatures. 

The finite-difference method superimposes a gridwork on the plate. Suppose we take 
these values for the parameters of the problem: a = 1, c = 100. With h = 0.2, the 
diagram of Fig. 7.11 results; there are 17 unknowns after we utilize the left-to-right 
symmetry. For points u, 43, ¢;)>, and u)7, the grid does not coincide with the boundary. 
It is easy to find by analytic geometry that the short arms at uv, and w)7 have a length of 
0.8990h and those at u; and uw; have a length of 0.5826h. Applying Eq. (7.31) we get 
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Table 7.5 
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Figure 7.12 


errors at points where the less accurate unequal-arm operators were used. This is not 
case for this example, however. 

If our mesh of points is chosen very fine (as we would do to get high accuracy), 
there is an even simpler way to handle irregular boundaries. One uses the closest 
point as the boundary point, thus in effect perturbing the actual region to one that coincides 
with the network, and then the standard operator of Eq. (7.26) applies everywhere. This 
introduces some error, of course, but often its effect is no worse than the O(h) operator 
of (7.31). It is also usually easier to program a computer solution using this perturbed- 
region technique. 

Sometimes the region, while not fit by a rectangular mesh, can be fit with nodes in 
a different arrangement. It is occasionally useful to have a finite-difference approximation 
to the Laplacian for an equispaced triangular network of points. To derive this, we need 
the formulas for a rotation of axes from analytical geometry. The point (x, y), written in 
terms of its coordinates with respect to the pair of x’,y’-axes that are rotated an angle 0 
from the x,y-system, is (see Fig. 7.12): 


x=x' cos 6— y’ sin 6, 
y =x’ sin 6+ y’ cos 8. 
We first compute du/dx’: 

D = Megs 9 + Hsin 6 = uzcos 6 + u,sin 8. 


= 
x & Bs 


= cos Hu,,cos 6 + usin 6) + sin O(u,,cos 6+ u,,sin 6) 
at a 6 cos 6 + u,,sin*6. 


Figure 7.13 
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For an equispaced triangular network, the connections between nodes make angles 
of 0°, 60°, and 120° with the horizontal (Fig. 7.13), Call these the a-, b-, and c-axes. 


For 6 = 0°: gia ies 
da 
For @ = 60°: a - ie + Lee + iM 
For @ = 120°: os - vue = Yay + 5 
Adding, we obtain 
wet tg Taek lta OF ay) = 3TH 


Laplace's equation, Vu = 0, can be represented by 


4 au e fu) 
ss tas+s3 
dav ab- de | 


3 


2/a2 
al; “ =0. 


Using finite-difference quotients to approximate the partial derivatives gives a pictorial 
operator for a triangular network: 


- 2 1 i 
u=s5)l -6 lew 
the 
3h 1 1 


Note that, for Laplace's equation, the potential at every point is the arithmetic average 
of the potentials at its six equidistant neighbors. We observe this rule of averages even 
in the three-dimensional situation below 

For circular regions, one may derive a finite-difference approximation (o the Laplacian 
in polar coordinates. Consider the group of points that are the nodes of a polar-coordinate 
network (Fig. 7.14): 


> Ou, law, 1 au 
w= t+-T—t+S5 
ve ore ror Poe 
_ My — 2tig + wy lym lu 2uig + Uy 
(Ar) ro 2Ar nm (Ae) 


Figure 7.14 


No “standard” operator can be written for this, so finding the coefficients and solving the 
set of equations by iteration when the problem is in polar coordinates is awkward, but 
it helps to group the various coefficients of each term. 

Calling up the central point, u; the point nearer the center, u; the point farther from 
the center, and uw, and u, the points on either side at the same radial distance from the 
center. as shown in Fig. 7.14, we get. for Laplace’s equation 


1 ar = =) 
vu = ol (1 - E)u + (1+ Eh + (25) : 


+ (dia) (2+6z9) ms] =0 


Let us illustrate the Laplacian in polar coordinates by again solving the example 
presented at the beginning of this section. In this we computed the steady-state temper- 
atures on a semicircular plate of radius equal to one, with its base held at 0° and its 
circumference at 100°. We superimpose a polar-coordinate system; with Ar = 0.2 and 
A@ = a/8, Fig. 7.15 results. There are 16 u-values to be determined, after we allow for 
left-to-right symmetry. 

When we compute the coefficients for Eq. (7.35) applied to each of the 16 points, 
the matrix representation of matrix (7.36) results. Table 7.6 compares the results by fimite 
differences with those computed from the infinite-series solution, Eq- (7.32). 

The agreement in Table 7.6 is about as good as that observed in Table 7.5 with a 
rectangular erid. However. one point, no. 4, stands out in Table 7.6 as having an unusually 
large error. Large errors often occur in finite-difference solutions near a discontinuity in 
the boundary conditions, and this could possibly be the explanation. On the other hand. 
this phenomenon is absent in Table 7.5, possibly due to a cancellation of errors. 
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Figure 7.15 u=0 - 


Table 7.6 Results for example by finite differences in polar coordinates, and analytical solution 


Point u, finite 
r 0 number differences 

0.8 a/2 1 85.661 
0.8 3n/8 2 84.425 
0.8 a/4 3 79.459 
0.8 7/8 4 63.679 
0.6 a/2 5 68.371 
0.6 37/8 6 66.126 
0.6 > 7/4 7 58.058 
0.6 7/8 8 39.165 
0.4 a/2 5 48.048 
0.4 37/8 10 45.543 
0.4 7/4 11 37.457 
0.4 7/8 12 22.374 
0.2 a/2 13 25.003 
0.2 3x/8 14 23.301 
0.2 a4 15 18.245 
0.2 7/8 16 10.147 


u, series solution 
Eq. (7.32) 


85.906 
84.792 
80.394 
66.186 
68.808 
66.673 
58.862 
39.624 
48.448 
45,938 
37,730 
22.252 

25.133 | 
23.393 

18.241 

10.066 


We observe another sparse matrix in (7.36). this time with a band structure. The 
structure will depend on the ordering of the points, however. 
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THE LAPLACIAN OPERATOR IN THREE DIMENSIONS 


Writing a difference equation to approximate the three-dimensional Laplacian is straight- 
forward. We use triple subscripts to indicate the spatial position of points, and take the 
mesh distance the same in each direction: 


Vu== 


Dui jk + Mi ji y Minjkt1 — 2Mi jk + Mi jdm 
he 


+ Uj j+ 1k 


We see again, for Laplace’s equation Vu = 0, that the potential w is the arithmetic 
average of its six nearest neighboring values. The set of equations for this case is more 
extensive and more tedious to solve, but in principle the methods are unchanged. In hand 
calculations using iterative methods, keeping track of the successive values of u is awk- 
ward. An isometric projection of the points is generally recommended over the use of 
superimposed sheets of paper. In a computer, triple subscripting solves the problem, but 
large storage requirements are imposed by problems encountered in practice. 

In three-dimensional problems, it is easy to exceed the memory space of even large 
computer systems. For example, if the volume under consideration has 100 points on a 
side, a total of 100° = 1,000,000 points are involved. If we were to represent the 
coefficients of the one million equations in a square matrix, no computer system could 
hold all the values in real memory. Virtual memory systems could perhaps accommodate 
that many values, but the speed of computation would be excruciatingly slow. 

Of course, there are at most seven nonzero coefficients in any of these one million 
equations, but even 8,000,000 numerical values (the right-hand sides are needed too) 
will exceed the memory capacity. Obviously, the problem must be cut down in size. 
However, even only 30 points on each side of a cubical volume results in 27,000 equations, 
with about 216,000 nonzero numerical values. While such a number of coefficients can 
be stored in large real memories or in virtual memory systems, lots of computer time is 
involved in iterating through the large set of equations. Unfortunately, many more iter- 
ations are required for convergence where the number of equations is great. It is little 
wonder that three-dimensional problems, while perfectly tractable in principle, are terribly 
expensive to solve. 
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7.11 


Figure 7.16 
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MATRIX PATTERNS, SPARSENESS, AND THE A.D.I. METHOD 


All of the examples that have been presented in this chapter illustrate the fact that the 
coefficient matrix is sparse when an elliptic partial-differential equation is solved by the 
finite-difference method. Especially in a three-dimensional case, the number of nonzero 
coefficients is a small fraction of the total. 

The relative sparseness increases as the number of equations increases. In the two- 
dimensional example of Section 7.4, the 21 21 coefficient matrix of Eq. (7.9) has 81% 
of its positions filled with zeros. If symmetry were taken into account, there would have 
been a 14 x 14 coefficient matrix with 71% zeros. Decreasing the value of h so as to 
have 105 points within the region gives rise to a coefficient matrix with 96% zeros. For 
a 30 x 30 x 30 three-dimensional case, there would be 90,176 nonzero coefficients, but 
this is only 0.012% of the 729 x 10° values in the coefficient matrix! 

In Chapter 2 it was shown that iterative methods are usually preferred for sparse 
matrices unless they have a tridiagonal structure. The reason for this is that elimination 
does not preserve the sparseness (unless the matrix is tridiagonal), so that one cannot 
work only with nonzero terms. This is illustrated by Fig. 7.16. The original matrix in 
(a) is shown in partially triangularized form in (b). The shaded portion of the matrix, 
which original had 50% zeros, has lost all but one of these. 

Frequently the coefficient matrix has a band structure, as illustrated by Eqs. (7.33) 
and (7.36). Equation (7.36) illustrates a special regularity for the nonzero elements; the 
ordering of the points must be carefully done to attain this. A band structure is worth 
working for, however, since elimination does not introduce nonzero terms outside of the 
limits defined by the original bands. Zeros in the gaps between the parallel lines are not 
preserved, though, so the “tightest” possible bandedness is preferred. Sometimes it is 
possible to order the points so that a pentadiagonal matrix results. 
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(b) 


The best of the band structures is tridiagonal, with corresponding economy of storage 
and speed of solution, as discussed in Chapter 2. A method for the steady-state heat 
equation called the alternating-direction-implicit (A.D.I.) method results in tridiagonal 
matrices and is of growing popularity. It was initially developed for unsteady-state prob- 
lems, which are the subject of Chapter 8. 

In the A.D.I. method, we write the finite-difference approximation to Laplace’s 
equation in this fashion: 


EXAMPLE 
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_ uf") + yu” 


iytd y Minny 7 AM, st = 0), (7.37) 
(x? i e 


The superscripts indicate the iteration number. Note that when we are calculating 
“2 
s e =. OU és . 
the (n + 1)st iterate, we use for the approximation axe? a set of values in a horizontal 


row that are already known, the results of the nth iteration. By writing the equations in 
the order of the points in each column, we obtain a tridiagonal matrix when the known 
values of u\"),, u("}, and u\"), , are carried over to the right-hand sides. 

The solution for steady-state values of u; ; is iterative, with the (k + 1)st and 
(k + 2)nd values computed from* 


ut) = ui) + pu), ult) nn uf) : plu, = 2 Ae OP) (7.38) 
followed by 
(k+2) = k+1) cen) = kt (1) 
ue = gt plu; Qu + uy) 
+ p(ult}?} — mutt) + “Ubi. (7.39) 


The relations are implicit but, by properly ordering the equations, we obtain tri- 
diagonal coefficient matrices. This requires that Eq. (7.38) proceed down each column 
of points in turn, while Eq. (7.39) processes them row-wise. (This is the reason the 
method is called the alternating direction implicit). The proper choice of the parameter 
pcan accelerate the convergence; for fastest convergence of large systems, a sequence 
of values of p is employed.* 

The iterations begin with an initial estimate of the u(° vector. The iterations are 
terminated when successive calculations agree within a given tolerance. Since there is a 
bias in the calculated results in the first, third, fifth, . . . iterates, one usually ignores 
these and records the results of only the even-numbered iterations. 

Except for the additional work of recalculating right-hand sides, this method has a 
minimum of computational effort, equivalent to $.O.R., but converging faster. The 
coefficient matrices are always the same, so the reduction step need be performed only 
once, if one uses the LU equivalent or some similar procedure. 


A square plate conducts only in the x- and y-directions. Two adjoining edges are kept at 
0° and the other edges are kept at 100°. A gridwork is drawn with four interior nodes 
(Fig. 7.17). Use the A.D.I. method to estimate the temperatures at these four nodes. 


*The basis for these iteration relations will become clear in the next chapter when we discuss the A.D.1. method 
for parabolic partial-ditferential equations. The method applied to elliptic equations is really a special case of 
that technique. 

‘Birkhoff, Varga, and Young (1962, pp. 189-273) give an extensive survey of A.D.I. methods for elliptic 
equations and present many examples. 
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Figure 7.17 
It will be convenient to use v for the temperatures when ordering the points col- 
umnwise and wu for the temperatures when ordering the points row-wise. Equations (7.38) 
and (7.39) become, in matrix form, 
(1/p) + 2 =i vy 100+ 0 + [(1/p) — 2Ju, + uw 
= (1/p) + 2 0 v2{_ | O+0 + [(1/p) — 2]u3 + ug 
0 (1/p) + 2 =] V3 100 + u, + [(1/p) — 2Ju2 + 100)’ 
= (1/p) + 2JdLvg. 0+ uz + [(1/p) — 2]ug + 100, 
(1/p) + 2 =] uy 0 + 100 + [(1/p) — 2], + v2 
=] (1/p) + 2 0 uz] _ | 100 + 100 + [(1/p) — 2]v3 + v4 
0 (1/p) + 2 -1 U3 0+ v, + [(1/p) — 2]. +0 | 
-1 (1/p) + 2JLuy 100 + v3 + [(1/p) — 2]v, + 0 
The results of successive iterations, beginning with u = (0, 0, 0, 0) and p = 0.9, 
are shown in Table 7.7. 
Table 7.7 
Iteration 
number vy v2 v3 V4 
1 . R i 35.85 11.52 83.21 58.89 
2 49.86 75.47 24.26 49.86 
3 50.27 25.26 74.73 49.71 
4 50.00 74.98 25.01 50.00 
5 
6 
7 


For this example, the optimum value of p is about 1.00, Figure 7.18 shows how the 
errors after six iterations vary with p. 
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Ex 10* 


Figure 7.18 


With the alternating-direction-implicit method, the errors show a most interesting 
pattern as the iterations proceed. Table 7.8 gives the average error at each iteration. Note 
that the error after an odd number of iterations is only slightly reduced over that of the 
previous iteration. This is true generally for A.D.I. methods. It is as if there is a bias in 
the values from the odd iterations that is compensated for in the succeeding even iterate. 


Table 7.8 Reduction in errors with iteration number—comparison of meth- 
ods in example problem 
I TTT 
Average error 


Iteration SS SSS 

number A.D.I. Liebmann’s S.O.R. 
1 11.185 17.970 15.055 
2 0.3725 5.272 2.930 
3 0.2725 1.318 0.2675 
4 0.0075 0.330 0.0275 
5 0.0065 0.082 0.0232 
6 0.0002 0.020 0.0022 
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7.12 


A very important phenomenon occurs with the A.D.I. method. The equations break 
up into independertt*subsets, here into sets of two. This can often help in the computer 
solution of very large problems. 

Table 7.8 also compares the errors when the example problem is solved by Liebmann’s 
method and by S.O.R., using w = 1.072 (this value is the calculated w,,, from Eq. 
(7.15)). In this example, the A.D.I. method is clearly preferable as judged by the rate 
of convergence, particularly as errors of the A.D.I. would decrease even faster if p had 
been taken at 1.0. = 


For elliptic problems in three dimensions, the A.D.I. method has great advantages. 
The directions alternate cyclically through the three coordinates, with new values of u 
being used in only one direction for each step, keeping the coefficient matrices tridiagonal, 
The references should be consulted for details. 


FINITE-ELEMENT METHOD 


In Section 7.9, we applied the method of finite differences, with some agony, to regions 
whose boundary does not coincide with the points of a rectangular mesh. We also showed 
how one can derive operators to solve Laplace’s and the Poisson equations for meshes 
that are not rectangular. Such manipulations suggest that the finite-difference method is 
ill-suited for other than regular rectangular regions. But the world is not always regular! 

The finite-element method, which we discuss in this section in an elementary way, 
is much better adapted to irregular regions. It uses a series of complex computations, but 
computers take care of that. There are many commercial packages, such as NASTRAN 
and CADAM, that handle a variety of elliptical differential equations. 

You have already had an introduction to the finite-element method. In Chapter 6, we 
solved ordinary boundary-value problems by minimizing a quadratic functional. In that 
procedure, we did not use the differential equation itself to get the solution; the function 
that minimized the functional was the solution to the differential equation. We also saw 
that these so-called variational methods can be applied piecewise within the domain of 
the problem. This is called a finite-element method because the domain is broken into a 
number of finite elements. Such methods are of great and growing popularity in solving 
partial-differential equations. Finite elements are especially useful when the domain of 
the boundary-value problem is two- or three-dimensional, particularly for irregular 
regions. Using finite elements also facilitates local mesh refinement in those parts of the 
region where the variables of interest vary rapidly or where discontinuities occur. 

In breaking a two-dimensional area into subregions, there is a wide choice of elements 
that span the region. Rectangles can be used but don’t fit as well to irregular boundaries 
as triangles do. Other polygonal shapes are possible and even elements with curved 
boundaries, but these are considerably more complex to use. Triangles are very popular 
and we will discuss only these.* A mixture of triangles and quadrilaterals is most com- 
monly used. 


*In three dimensions, the corresponding choice for the elements is a tetrahedron, for the same reasons. 
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In close analogy to the piecewise application of Rayleigh—Ritz in Chapter 6, we do 
the following: 


Find the quadratic functional that corresponds to the differential equation. 
This is well known for a large class of problems. 

Subdivide the region into subregions (we use triangles) that span the region 
of the problem. 

Write relations that interpolate values of the function (the solution) at the 
nodes (vertices) to give values of the function at points within the element. 
The interpolation relations are chosen to be zero outside the element so there 


is a purely local effect. We use a sum of these relations, weighted by the 
nodal values, as an approximation to the solution to the problem. 


Substitute this weighted sum into the quadratic functional and minimize with 
respect to each unknown weighting factor by setting derivatives to zero. 
The quadratic functional breaks into a sum of integrals over each element. 
This leads to a set of linear equations that we can solve to give the solution 
to our original partial-differential equation. There is an alternative approach, 
the Galerkin method, that arrives at the same end result. 


We will examine each of these four steps in turn. While we will work only in two 
dimensions, everything extends readily to three. 


1. Find Functional The second-order linear boundary-value problem that will be our 
model is 


&u/dx? + d°u/ay? + Qu = F on region R, (7.40) 
with boundary conditions 
u=ugonLl,,  du/dn + au = Bon Ly. 


In the above, Q, F, u, a, B are all functions of x and y, The boundary of R is divided 
into portions (L,) where u is known (Dirichlet condition) and portions (L>) where a mixed 
boundary condition holds. du/dn is the derivative of u normal to L. 

Using methods similar to those of Section 6.4, the quadratic functional corresponding 
to Eq. (7.40) can be developed. It is 


| I, [() + () — Qu? + 2Fu| dx dy + [. [au? + 2Bu] dL. (7.41) 


ax ay. 


2. Divide into Elements As we have said, we elect to use triangular elements. The 
choice of where to locate the vertices of the triangles is, in part, an art. In general, one 
puts many vertices in areas where the function is expected to change rapidly (corresponding 
to making a smaller mesh size with finite differences). It is a good idea to make the sides 
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Figure 7.19 


Figure 7.20 


tun in the direction of largest gradient. Along a curved or irregular boundary, the sides 
of the elements should closely approximate the boundary. Every vertex must be a vertex 
of all triangles that fouch at that point. As you probably expect, the more elements, the 
smaller the error in the solution. 

The chore of defining the coordinates of the nodes (element vertices) is facilitated 
by computer programs, especially when one has access to a graphics terminal. At that 
device, the numerical analyst can point to locations where he or she wants a node and 
its coordinates are automatically recorded. There are routines that can divide any given 
region into triangles, but these usually don’t have the expertise of an experienced engineer. 


3. Interpolating Relations More sophisticated interpolating relations are often used, 
but we will assume that, within any single triangular element, wu is a linear function of 
the u-values at the vertices. Note that assuming a linear relation within an element does 
not imply that u is linear over the whole region R. Figure 7.19(a) illustrates the region 
R subdivided into elements. Each node within R is shared by two or more triangular 
elements. One of the elements is shown; its upper facet represents u within the element. 

Figure 7.20(a) shows a typical element in plan view with nodes numbered 1, 2, 3 in 
a local numbering system. We select the nodes (vertices) in counterclockwise order. (It 
is important to be consistent in the direction of ordering.) The coordinates of the nodes 
are (x;, yj), § = 1, 2, 3. 

The linear relation within the element can be written 


My 
u(x, y) = Nyu, + Nau, + N3u3 = (N, Nz N3) | uo| = N{ut. (7.42) 
a 


(a) (b) 


a) ib) 
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The N; are called shape functions or pyramid functions. (The shape function for node 3 
looks like an unsymmetrical pyramid of height = 1 with its apex exactly above node 3. 
Its base is the element. It is sketched in Figure 7.20(b). The shape functions for nodes 
1 and 2 are similar.) It will be convenient to work with vectors and matrices as shown 
in the last part of Eq. (7.42). Obviously each N, depends on the location (x, y) within 
the element where the value of u is to be computed, as well as on the location of the 
nodes (x;. y;), i = 1, 2, 3. We now develop the expression for the N;. 

If u(x, y) varies linearly with position within the element. an alternative way to write 
the linear relation is 


u(x, y) = cy + Cox + C3, 
which must agree with the nodal values when (x, y) = (x,, y,). Hence 
uy = Cy + 2k F C3y1, 
Us = Cy + CpXz + C32, 
Uz = Cy + C2X3 + Ca ¥3- 
This is a system of equations 
Mic} = {u} (The curly brackets indicate a column vector.) 


where 


1 x; » 
1 xp y2},¢ = (cy Cy ¢3),4 = (Uy, Uy U3). (7.43) 


Solving for c: 
{c} = Mu}. 


The inverse of M is not difficult to find: 


(7.44) 


(y3 — Yi) ("i -) 


“c — *3Y2) (391 — %1¥3) Oy y¥2 — X2¥y) 
( 
( 


(x, — X3) (x2 — X) 


with 2A = det(M), which tums out to be the sum of the elements in row | of Eq. 
(7.44) within the brackets. A is the area of the triangular element.* You should verify that 
M~'M = I to ensure that Eq. (7.44) truly gives the inverse matrix. 

To apply the interpolating function to the minimizing of the quadratic functional (Eq. 
(7.41)), we prefer to write u in terms of the shape functions (Eq. (7.42)). This is easy 
to do. We have 


u(x, y) =e, + oxtoy=(1 x yXch}=(1 x yyM Yu}. 


*That A = } det(M) is shown in most books on vectors where the cross product is explained 
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But, in terms of N (from Eq. (7.42)), 
Ce u(x, y) = Mu}. 
Comparing the two expression gives 
N=(1 x ym (7.45) 


where M~! is given by Eq. (7.44). This concludes step 3. Before we go on to step 4, 
we show an example to clarify step 3. 


&©xX AMPLE Consider the triangular element with nodes 1, 2, 3 in ccw order (see Figure 7.21): 


Node e y u 
1 0 0 100 
2 2 0 200 
3 0 1 300 


Find c, N, and u(0.8, 0.4). 

Before we use the above formulations, we can solve by inspection. Point (a) is 
at (0, 0.4), so u there is 180 by linear matexpolstion between nodes | and 3. Similarly, 
u at (b) is 240. The point (0.8, 0.4) is 5 of the distance between (a) and (b), so 
u(0.8, 0.4) = 180 + 3 (240 — 180) = 220. We get the same result by interpolating 
between points (c) and (d), and between node | and (e). 

To get c we first compute 


1. OO 0 
M={|1 2 0], M!=|-05 0.5 
101 0 


(0. 0) (2, 0) 
Figure 7.21 u= 100 u= 200 
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by substituting x;, yj, i = 1, 2, 3 in Eqs. (7.43) and (7.44). So 


1 0 olfioo 100 
{c} = Mu} =|-0.5 0.5 0]}200] =| 50], 
-1 0 1]}300] | 200 


giving u(x, y) = 100 + 50x + 200y. (You should confirm that this gives the correct 
values for each u;.) 
From Eq. (7.45): 


N=(1 x yM!'=(1-05x-y 05x y). 
In nonmatrix form: 
u(x, y) = N{u} = (1 — 0.5x — y)uy + (0.5Sx)uy + (y)u3. 


Observe that the coefficients of each N, can be read directly from the columns of M~!. 

At this point we know how to write u(x, y) within the single triangular element (e) 
as u(x, y) = M®{u'©!}. (The superscript (e) tells which element is being considered when 
this is necessary.) We now stipulate that every N° = 0 everywhere outside of element 
(e). Therefore we can write, for any point (x, y) within region R, 


u(x, y) = > NO {uO}, 


all elements 


This implies that u(x, y) is a surface composed of joined planar triangular facets, « 


4. Substitute into Functional and Minimize The development of step 4 is rather 
involved, but it flows logically. We are to substitute, for w in Eq. (7.41), 


u(x, y) = > NO {yl}, (7.46) 


all elements 


where the elements of each N° are obtained in step 3. The u'*) are not yet known except 
where specified by Dirichlet boundary conditions. Our objective in step 4 is to obtain a 
system of linear equations of the form 


K{u} = {b} 


that we will solve for the unknown u. We develop values in the rows of K and b by 
setting d//du,; = 0 for each unknown u;, where / is the functional (Eq. (7.41)). 
When the relation in (7.46) is substituted into the functional, we get 


Tu] = tl. [(é3 = Ma} y - (- 2 Ma}): 7 o> Map) +2F> way | deidy 
+ [. [e(= Mut) + 2p > xa| dL. 
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Because all N°) = 0 outside of element (e), this breaks into a sum of integrals over each 
individual element, 


Iu) = >>| f i 7 (2 Neu) figs | f sf (2 wrotu'e}) fae 
fi] 


+ | ! M O(N uO}? de dy + | | a 2FN“u)} de dy 

[3] (3) 
+ fe a(NO{uOP? dh + i, 2BN{u©)} a. (7.47) 
ees * 


The separate integrals are interconnected through the nodes that are shared by adjacent 
elements. What this means is that several members of the summation may contribute to 
the same elements of K and b. We will treat each part, 1 through 6, of Eq. (7.47) in 
turn. 
The development of the partial derivatives of Eq. (7.47) will be clearer if we follow 
a specific example. To keep things simpler, the example has only two elements and four 
nodes, and u is unknown at only two of these nodes. 


EXAMPLE Solve 


au duu 3 
at * a 2 2 


(10, 9) 
(0, S) 
® 4 
ou 
ss a 0 

u varies () 

linearly 
0, 0) = 0. 
(0, 5) = 5. 


“Ooo 0) 


ou 
Figure 7.22 egos Glam 
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Eq. (7.40), we have Q = — 
of step 3 to get NM”) and N?): 


,F = —5, a = —0.2, B = —0.4. Using the procedure 


ni= 


@f1 0 0 1 0 0 
M)=Q)1 8 0], Mt =]-0.125 0.125 0 |e du/ax 
@l1 10 9 0.028 —0.139 0.111 |— du/ay, 
® ® ®@ 
@/1 00 1 0 0 
M2 =QI}1 10 9], M>! =] 0.080 0.100 —0.180 |— au/ax 
@}1 0 5 0.200 0 0,200 |< au/ay, 


Go ® @ 
A) = 36, AQ = 25, 
The circled numbers indicate the nodes that are related to the rows or columns. The row 


indicators in the inverses will be explained later. 
We then have 


fis a a (1 — 0.125x + 0.028y)u; + (0.125x — 0.139y)u, + (0.11 1y)uy in (1), 
MX) = 11 — 9.080x — 0.200y)u, + (0.100x)u; + (0.180x + 0.200y)u, in (2). 


(7.48) 


The sum breaks into disjoint functions because N'* = 0 outside of (e). We could simplify 
the preceding by substituting the known values for u, and ws (known from the boundary 
conditions), but we defer. 

Now consider each part of Eq. (7.47) in turn. 


For part fi] of Eq. (7.47): 
Using Eq. (7.48), 


au? 
ae = -0,125u, + 0,125u, + (O)u3. 
au 
a 0.080u,; + 0.100u; — 0.180u,. 


Observe that these come directly from the second row of the inverse matrices as indicated 
by the arrows. 

Squaring these and taking partial derivatives d/du, and d/du3 (we don’t take deriv- 
atives with respect to u; and us because these are fixed by the boundary conditions; they 
don’t vary), 

a 1) 
—: 2(—0.12Su, + 0.125u>)(0.125) | | dA 
duz 

= 2(0.125)(—0.125u, + 0.125u)A": 
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) (1) 
ree 2(-0.125u, + 0.125u5)(0) | dA 
(2) 
+ 2(0.080u, + 0.100u; — 0.180u3)(0.100) | i dA 


= 0 + 2(0.100)(0.080u, + 0.100; — 0.180u,)A). 


Again, observe that these could be written directly from the inverse matrices. Note that 
there is no term from (2) in d/du because node 2 is not in element (2). The zero occurs 
in d/du; because the coefficient of u3 is zero in N')). 

In the preceding, u, = 0, uy = 5, A") = 36, A?) = 25, so we have these contributions 
to rows | and 2 of K and b: 


For part | 2) of Eq. (7.47): 


This is very similar to part | ||, and we know now how to write the result directly from 
the inverse matrices: 


a 
a8 2(0.028u, — 0.139uy — 0.11 1u3)(—0.139)A™; 
2: 2(0.028u, — 0.1394; — 0.11 1u;)O.111)A + 0. 
3 
Again, terms vanish for the same reasons as in| 1 |. Substituting u, = 0, A“) = 36, we 


get these contributions to elements of K: 


To row I: 1.391u — 1.11103, 


To row 2: —1.11luy + 0.887u;. 


For part | 3| of Eq. (7.47): 


Consider a general term for element (e) whose nodes are r, s, f: 
(e) (e) 
[ [7 out ae ay = [ P00, + Ny + Na)? de dy 


(e) 
= | [° -owne + whe + id 
+ INN, + 2N.Nutt, + 2NNugt) dx dy. 


ot 


= 
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If we take the partial derivative 0/du,: 
te) 
a y= ia —Q(2N2u, + 2N,N,u, + 2N,N,u,) dx dy, 
we see that we must evaluate integrals of the form 
te) 
Qu; i] -ON,N) dA, j=r,s,t. 


If Q is constant over the element (e), this is not hard to evaluate, because 


(ey ,) a 
_ _ [-Oa/o iff=k, 
2 lf MN axa {eh ifj # k* 


Usually, for a Q varying with x and y, one takes an average value for Q(x, y) 
throughout the element, often the average of the values of Q at the vertices. 

Returning now to our specific example where O = -}, we have 
a. 1 


r pil 
ray 5 | | 2NyN\uy + N3uy + NyN3uy3) dx dy = 


a. 1 
duy” 12 
A. 


(2) 
+ D al 2(N3Nyu, + N3ux + NyNyuy) dx dy 


2(N3Nju, + N3Nouy + N3uy) dx dy 


AW) AD AD A (2) AQ 
= M5 a) + U3 + uy D + Us + uty: 


Substituting the known values of u; = 0, uy = 5, A!) = 36, A?) = 25, we get these 
contributions to K and b: 


6uy + 3u3, 


3u, + 10.166u; + 10.417. 


For part | 4| of Eq. (7.47): 


Consider again the general case of element (e) with nodes r, 5, & 


r rte) (e) 
| | 2Fu dx dy = | [ 2F(N,u, + Nu, + N,u,) dx dy. 


*This comes from the integral of products of powers of shape functions over a triangular element 


2a! bt ct 
Al, A‘ = area of element 


Deriving this is not easy! 
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It is not hard to show that 


_ fet wp=% 
a [ym aie ie iff #k, 


where L = the length of the side from node s to node 1. (Actually, on this side N, = 0, 
N, varies | to 0, and N, varies 0 to 1.) 
For our specific example, only element (1) has a side (from node 1 to node 2) on 


L, and a = —0.2. Hence these contributions to K result: 
#: -0.2 I © 2(NpNyuy + N3uy + NyNyus) dx 

2 fll 

—0.4(x) = 2 —0.40 - 
=i, a x,) ds ity Oe x) + ux(0); 

a ps2 : 
pag -0.2 I. 2(N3Nyuy + N3Nguy + N3u3) dx 

3 \ 


0 (because N; = 0 on L,). 


Substituting u; = 0, x. — x; = 8, we get the contributions 


—1.066u>, 


none. 


For part | 6| of Eq. (7.47): 


The integral (of Bu dL over L,) also applies only to elements that have a side on L5. This 
part parallels part| 4 |: We attack it in the same way. Consider the general element (e) with 
nodes r, s, t. Suppose that side s-r is on Lj. We will again take B = constant. 

As in part | 4}, 


aa 
du, 


Nin 


node 1 
=2 =2 
y= 28 [Nl = 28 


where L is the length of the side s-t, because N, is a triangle with side s-r as its base and 
a height of 1. 
In our specific example, B = —0.4, L = 8, so we get contributions to b: 


x-0.4)8 : to row |. 


none, because N; = 0 on the side 1-2. 


Table 7.9 assembles the separate contributions. 
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Table 7.9 


7.13 


i pee 


Now we have the system from which we solve for u: 


7.45 1.889][u,] _ [39.2 
1.889 11.553 ||u3| ~ | 46.083" 


which gives u; = 4.230 and uw, = 4.075. (Note carefully that the signs for b change 
from those given in the assembly display. In that display they were in terms of Ku + 
b = 0, so we had to change signs to put them on the right-hand side.) 

All this seems very tedious when done by hand, but everything is quite straightforward 
and not too difficult to put into a computer program. The most difficult part is to guide 
the user in inputting the coordinates of the nodes, the boundary conditions, and the 
coefficients of the equations. Care is needed in the design of the data structure so that 
elements, nodes, and sides can be interrelated. * 


CHAPTER SUMMARY 


If you fully understand Chapter 7. you should be able to do the following. 

1. Explain how to analyze heat flow to derive the partial-differential equation that 

2. Tell in which class a given differential equation falls. 

3. Set up the system of equations that results from replacing derivatives with finite- 
difference approximations and solve by either elimination or iteration. You should 
be able to handle a variety of boundary conditions. 

4. List the advantages of iteration over elimination for solving large systems and explain 
why overrelaxation speeds up convergence. 

3. Show how the finite-difference method can be adapted to nonrectangular regions. 

6. Discuss the problems encountered in three-dimensional situations. 
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7. Use the A.D.I. procedure to solve problems and tell why it is advantageous. 


8. Solve a small problem by the finite-element method and compare it to the finite- 
difference method. 


9. Utilize the computer programs of this chapter to get the numerical solution to typical 
examples. 


SELECTED READINGS FOR CHAPTER 7 
Allaire (1985); Burnett (1987); Davis (1980); Smith (1978); Vichnevetsky (1981). 


COMPUTER PROGRAMS 


Since Laplace's equation is a special case of Poisson's equation, we present two computer 
programs for the latter, in two dimensions. They can be applied to Laplace's equation 
by defining f(x, y) = 0. 

The first (Fig. 7.23) is a program that generates the coefficient matrix for the Laplacian 
on a rectangular region. The parameters passed to the subroutine LAPMTX include the 
matrix A into which the coefficients are placed, NDIM, the first dimension of the matrix 
A in the calling program, and N and M, the number of intervals in the x- and y-directions. 
Only nonzero values are inserted in the subroutine, so that the main program must set 
all the values of A to zero. Finally the parameter, METHOD, determines whether the 
five-point (Eq. 7.6b) or the nine-point (Eq. 7.6c) is used. This is determined by setting 
METHOD equal to either 5 or 9 in the calling arguments. This particular program 
implements the example in Section 7.4 and produces (Eqs. 7.94) and (Eqs. 7.9b) for 
METHOD = 5 and 9 respectively. 

SUBROUTINE LAPMTX is used in connection with a linear-equation solver such 
as subroutines LUD and SOLVE of Chapter 2; the main program must establish the right- 
hand-side values based on the boundary conditions and the f(x, y) values, combining 
them with the matrix produced by LAPMTX and sending them to the linear-equation 
solver to obtain the solution vector, which will then be the calculated steady-state potential 
values at each node of the grid. The difficulty with this technique is the large memory 
requirements that may be needed to store the values of matrix A. 

Program 2 (Fig. 7.24) illustrates how successive overrelaxation can be used on 
problems involving Poisson's equation. The program as shown is applied to Laplace's 
equation by defining the f(x, y) term in Poisson’s equation as zero. 

The constants of the problem are set by DATA statements and DO loops; these would 
be modified for a different problem. As shown, they apply to the example problem of 
Section 7.5 in which the steady-state temperatures within a rectangular region are deter- 
mined with 105 interior points. The value of the overrelaxation factor was varied to 
generate the data of Table 7.3. The program can be readily adapted to other problems 
with a rectangular region and Dirichlet conditions. 

Program 2 begins the iterations by setting all interior points at the average potential 
of the boundary points. If better starting values are known, they might be read in rather 
than computing an average potential. If f(x, y) is variable over the region, appropriate 
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PROGRAM LAPLACE (INPU®;OUTPUT, TAPE6=OUTPUT) 


THIS PROGRAM GENERATES THE COEFFICIENT MATRIX FOR THE LAPLACIAN ON 
RECTANGULAR REGION THAT HAS N POINTS IN THE X-DIRECTION AND M POINTS 
IN THE Y-DIRECTION. 


THIS PROGRAM CAN GENERATE A MATRIX BASED ON EITHER THE 5-POINT OR THE 
9-POINT METHOD PRESENTED IN SECTION 7.4 


THIS PROGRAM SOLVES THE EXAMPLE PROBLEM IN SECTION 7.4 USING BOTH 


THE 5-POINT AND 9-POINT METHODS OF FINITE DIFFERENCES WITH H TAKEN 
AS H = 0.125 


aaaananagnaaaAnAA 


REAL A(40,40), RHS(40), H 
INTEGER N,;M, MAXSIZ,MSIZE,METHOD, 


DATA A,RHS/1640*0.0/ 

DATA MAXSIZ/40/ 

DATA N,M/7,3/ 

DATA NDIM/40/ 

PRINT *, ’ ENTER A ’’5’’ FOR THE 5-POINT OR ’ 
PRINT *, ’ A ''9°" FOR THE 9-POINT. ’ 


READ *, METHOD 

IF ( (METHOD .NE. 5) .AND. (METHOD .NE. 9) ) THEN 
PRINT *, ' ENTER ONLY A ‘’5’" ORA '/9/" * 
GO TO 1 
END IF 


CALL LAPMTX (A, NDIM,N,M,METHOD) 


PRINT OUT MATRIX 


MSIZE = M * N 

DO 10 I =1, MSIZE 
WRITE (6,’(40I5)") (INT(A(I,J)), J = 1,MSIZE) 
CONTINUE 


SUBROUTINE LAPMTX(A,NDIM,N,M, METHOD) 
SUBROUTINE LAPMTX : 


THIS SUBROUTINE FORMS THE COEFFICIENT MATRIX 
FOR THE LAPLACIAN ON A RECTANGULAR REGION THAT HAS N POINTS IN THE 
X DIRECTION AND M POINTS IN THE Y DIRECTION. IT ASSUMES THAT THE 
MATRIX ALREADY HAS ZEROS EVERYWHERE. 


aagAAAA AAAAAADA 


Figure 7.23 Program 1. 
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Figure 7.23 (continued) 


c 
c 
c 
c 
Cc A - MATRIX THAT RETURNS COEFFICIENTS, MUST CONTAIN ZEROS TO ST. 
c NDIM - FIRST DIMENSION OF A IN THE MAIN PROGRAM 
c N ~ NUMBER OF GRID POINTS IN THE X DIRECTION 
c M - NUMBER OF GRID POINTS IN THE Y DIRECTION 
c METHOD - 1 IMPLIES FIVE POINT FORMULA 
c 2 IMPLIES NINE POINT FORMULA 
c 
Gy Sesctseedsee ceca asad as este ae sei ea eeee eon 
c 
REAL A(NDIM,NDIM) 
INTEGER NDIM,N,M,NL1,NP1,MSIZE,MSLN, METHOD, I 
c 
ee ee ee ee ee re ee ee 
c 
C FIRST WE COMPUTE SOME CONSTANTS 
c 
NL1=N-1 
NP1=N+1 
MSIZE = N*M 
MSLN = MSIZE - N 
e 
@) Seeks Stee er ee ee 
c 
c FIVE POINT METHOD IS DONE HERE 
c 
GS a ae pcr pigs 
c 
IF (METHOD .EQ. 5 ) THEN 
c 
( “banteeeetrartt  e  ehe e 
c 
© PUT -4 ON THE DIAGONAL 
c 
DO 10 I = 1,MSIZE 
A(I,I) = -4.0 
10 CONTINUE 
c 
@ scckeweeecrcseere ppoeec ete sone See esas renee eeemen ees seeencnenue 
c 
C NOW PUT 1 ABOVE AND BELOW THE DIAGONAL 
c 
DO 15 I = 2,MSIZE 
A(I-1,1) = 1.0 
A(I,I-1) = 1.0 
15 CONTINUE 
c 
Fe cc aa pr eo 
c 
C NOW REPLACE SOME ONES WITH ZEROS 
€ 
DO 20 I = N,MSLN,N 
A(I,I+1) = 0.0 
A(I+1,1) = 0.0 
20 CONTINUE 


PUT IN THE SIDE BANDS AND GO BACK 


aa000 
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Figure 7.23 (continued) 


DO 30 I = NP1,MSIZE 
A(I,I-N) = 1.0 
A(I-N,I) = 1.0 


30 CONTINUE 
ELSE 
c 
@ secesenne es bo eee 
c 
c PUT -20 ON THE DIAGONAL 
c 
1G); Becbecs ses ce ae so he Sr ee are ee 
c 
DO 40 I = 1,MSIZE 
A(I,I) = -20.0 
40 CONTINUE 
c 
IS a ae ca a re ee 
c 
c PUT 4 ABOVE AND BELOW THE MAIN DIAGONAL 
c 
C  sesseeee hoa Se ee 
c 
DO 50 I = 2, MSIZE 
A(I-1,1) = 4.0 
A(I,I-1) = 4.0 
50 CONTINUE 
c 
NS a 
c 
c REPLACE SOME 4’S WITH 0’S 
c 
i ee Es Se ee 
c 
DO 60 I = N,MSLN,N 
A(I,I+1) = 0.0 
A(I+1,I) = 0.0 
60 CONTINUE 
c 
c 
c 
ie 
c 
iG 
c 
DO 70 I = NP1,MSIZE 
A(I,I-N) = 4,0 
A(I-N,I) = 4.0 
A(I,I-N+1) = 1.0 
A(I-N+1,I) = 1.0 
A(I+1,I-N) = 1.0 
A(I-N,I+1) = 1.0 
70 CONTINUE 
c 
DO 80 I = NP1,MSIZE,N 
A(I-1,I+N) = 0.0 
A(I,I+N-1) = 0.0 
A(I+N-1,I) = 0.0 
A(I+N,I-1) = 0.0 
80 CONTINUE 
ic 


ENDIF 


529 
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(continued) 


Figure 7.23 


OUTPUT FOR PROGRAM 1 


FIVE POINT METHOD 


SOSCDDDOHOTOHSOSOSCCOCOCOOOH OOOO ASDOCOCOOSGOCCOCCHKHCT 
' ' 


SCODCOHOTOHOOOSCOCOCOOSCOH OOOOH OOOO OCOOCOOOKMOTO4 
1 1 


SCOHOTFOHOOCODOSCOCOSCOH OOOOH SDOCOCOCODOCOCONOTOnHOO 
1 1 


HOTOHDDCOSCODCOSCOSOH OOOOH OSCOOOOOOOCOCOHOTOHOOOO 
1 ( 


TOHSOCOCCOCDOOCCOCOH COC OH OOO OSC OOOO OOH OTOHOOOOCOO 
1 1 


NINE POINT METHOD 


ecosccooscce 
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Figure 7.23 (continued) 
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to) 
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t) 
te) 
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0 
4 
0 
-20 
0 
4 
0 
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0 
0 
to) 
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' 
nN 
COBDOSOBODDDOCDCOOPOBSOKPFPOBROHOOOD00 


1 
N 
BODOHRODDDODCOCOKHOBOPHOBOHOOODDGG000 


COSCOKRODOKODODDOCOSCOHOSOPHOBROHOO0G 
DOBOSCCSCOSDDOSCCOSOPHOBROHOOODOOGOCOn 


' 
nN 


PROGRAM POISSON (INPUT, OUTPUT) 


A PROGRAM TO SOLVE POISSON’S EQUATION ON A RECTANGULAR REGION. 
THE ACCELERATED LIEBMANN’S ( OVER/RELAXATION ) METHOD IS USED. 


PARAMETERS ARE : 


NWIDE NUMBER OF NODES IN THE X DIRECTION 
NHIGH NUMBER OF NODES IN THE Y DIRECTION 

F (X,Y) R.H.S. FUNCTION FOR POISSON EQUATION 
TOL TOLERANCE TO STOP ITERATIONS 

WwW OVER-RELAXATION FACTOR 

H MESH SIZE 


aagNAAAAAAAAAAAAAA 


REAL U(100,100) ,SUM, UAVG, RESID, CHGMAX, TOL, W,H 
INTEGER NWIDE, NHIGH,NHP1,NWP1,1,J 


Figure 7.24 Program 2. 
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Figure 7.24 (continued) 


C DEFINE THE FUNCTION WITH AN ARITHMETIC STATEMENT FUNCTION AND 
C INITIALIZE DATA AND VARIABLES. 


F(X,;¥) = 0.0 

DATA NWIDE, NHIGH, TOL,W,H/16,8,.001,1.4,1.25/ 
NHIGH + 1 

NWIDE + 1 


z 
= 
i] 
rey 
“u 


USE DO LOOPS TO ESTABLISH BOUNDARY VALUES 


aaaan 


DO 1 I = 1,NHP1 
U(I,1) = 0.0 
U(I,NWP1) = 100.0 

1 CONTINUE 

DO 2 I = 2,NWIDE 
u(1,I) = 0.0 
U(NHP1,I) = 0.0 

2 CONTINUE 


COMPUTE MEAN VALUE OF THE BOUNDARY VALUES TO USE FOR INITIAL VALUE 
OF INTERIOR POINTS 


aaaaaa 


5 SUM = 0.0 
= 1,NHP1 
SUM = SUM + U(I,1) + U(I,NWP1) 
10 CONTINUE 
DO 20 I = 2,NWIDE 
SUM = SUM + U(1,1I) + U(NHP1,I) 
20 CONTINUE 
UAVG = SUM / FLOAT(2*NWP1 + 2*(NHIGH - 1)) 
x = 0.0 
Y= 0.0 
DO 30 I = 2,NHIGH 
DO 30 J = 2,NWIDE 
U(I,J) = UAVG + H*H*F (X,Y) 
30 CONTINUE 


PRINT A HEADING AND BEGIN THE ITERATIONS. LIMIT TO 100 ITERATIONS OR 
UNTIL THE MAX CHANGE IN U VALUES IS .LE. TOL. 


aaaana 


PRINT 199, W 
DO 50 KNT = 1,100 
CHGMAX = 0.0 
DO 40 I = 2,NHIGH 
Y= (I-1) * # 
DO 35 J = 2,NWIDE 
X= (J-1) * #H 
RESID = W/4.0*( U(I+1,J) + U(I-1,3) + U(I,J+1) + U(I,d-1) - 
+ 4.0*U(I,J) + H*H*F(X,¥) ) 
IF ( CHGMAX .LT. ABS(RESID) ) CHGMAX = ABS(RESID) 
U(I,J) = U(I,J) + RESID 
35 CONTINUE 
40 CONTINUE 
IF ( CHGMAX .LT. TOL ) GO TO 55 
50 CONTINUE 
55 PRINT 200, KNT,CHGMAX 
DO 45 I = 1,NHP1 
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Figure 7.24 — (continued) 


PRINT 201, ( U(T,J), J = 1,NWP1 ) 
45 CONTINUE 
W=W+ 0.1 
IF (W .LT. 1.8) GOTO 5 


199 FORMAT(///’ ITERATIONS WITH OVER-RELAXATION FACTOR OF ’,F5.2) 
200 FORMAT(/’ AFTER ITERATION NO. ‘,13,’ MAX CHANGE IN U = ’, 

+ F8.4,’ U MATRIX IS '/) 
201 FORMAT (1X, 9F8.2) 

STOP 

END 


OUTPUT FOR PROGRAM 2 
ITERATIONS WITH OVER-RELAXATION FACTOR OF 1.40 


AFTER ITERATION NO. 29 MAX CHANGE IN U = 0008 U MATRIX IS 


100. 
1. . 

100. } 
Ze 

100. 


| 4 ; 5 : ‘ : : a 
100. 
3 


100. 
a, 
100. 


78 
100. 
1 


100. ; 
100. 


ITERATIONS WITH OVER-RELAXATION FACTOR OF 
AFTER ITERATION NO. 26 MAX CHANGE IN U = U MATRIX IS 
100. 
as 
100. 
2 
100. 
ai 
100. 
3. 
100. 
3 


anaaaaaAAAAAAAAAAAAAAAAA 


Figure 7.25 


PROGRAM ADIELL (INPUT, OUTPUT) 


THIS PROGRAM SOLVES A STEADY STATE HEAT FLOW PROBLEM BY THE 
ALTERNATING DIRECTION IMPLICIT METHOD. 


PARAMETERS ARE : 


UCOEF - MATRIX OF COEFFICIENTS FOR ODD TRAVERSES 
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values of the x- and y-coordinates of each point will need to be defined within the DO 
loops that end on statements 35 and 40, so that the arithmetic-statement function will be 
correctly evaluated. 

If one does not wish to investigate the effect of w on the number of iterations required, 
the program would be run with only one value for w, presumably that calculated from 
Eq. (7.15). 

The A.D.1. method is the basis for Program 3 (see Fig. 7.25). It solves for the 
steady-state potential from Laplace's equation based on a rectangle. The size of the region 
(as specified by M and N, the number of intervals in the x- and y-directions), the accel- 
eration factor p, and constant boundary conditions are defined by DATA statements; 
DATA statements are also used to insert ones into the off-diagonal elements of the u- and 
v-coefficient matrices. Program 3 solves a problem with 105 interior points. 

The u- and v-coefficient matrices are established, as are the boundary-value terms 
for computing the right-hand sides for equations analogous to those of Section 7.11. The 
LU decompositions of the coefficient matrices are then found, in preparation for solving 
the tridiagonal systems. 

After this preliminary work, the right-hand-side vector for the first traverse is com- 
puted using initial estimates of the potential; then the equations are solved for the first 
iterate of the values of the potential at the interior points in the rectangular region. 

These first calculated values are then used to compute the right-hand-side vector for 
the vertical traverse; the alternate set of equations is solved for the second iteration. 

Iterations are continued, alternating the direction of traversing the points, until the 
final solution is obtained. A better way to terminate the iterations would be to compare 
the change in components of the solution vector against a tolerance criterion, rather than 
to perform a fixed number of iterations as done here. 


- VECTOR OF POTENTIAL VALUES FOR ODD TRAVERSES 
- SAME FOR EVEN TRAVERSES 


VCOEF - MATRIX OF COEFFICIENTS FOR EVEN TRAVERSES 
BCNDU - VECTOR OF BOUNDARY VALUES FOR UCOEF VALUES 
BCNDV - SAME FOR VCOEF 


TOP - VECTOR OF BOUNDARY VALUES ACROSS TOP 
BOT - SAME FOR BOTTOM OF REGION 

LFT - SAME FOR LEFT HAND EDGE 

RT - SAME FOR RIGHT HAND EDGE 

M - NUMBER OF ROWS OF NODE POINTS 

N - NUMBER OF COLUMNS OF NODE POINTS 
RHO - ACCELERATION FACTOR 


Program 3. 
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Figure 7.25 


aaaaa 


aaaaa 


aaaaaa 


aaaaa 


(continued) 


SET UP THE VECTORS AND MATRICES 


REAL UCOEF (500, 3), VCOEF (500, 3) , BCNDU (500) , BCNDV (500) 
REAL TOP (100) ,BOT(100),LFT(100),RT(100) ,U(500) ,V (500) 
INTEGER M,N,MSIZE,ML1,MSL1,NL1,1,J,K,ITMAX, KNT, L, JROW 


INITIALIZE SOME VALUES WITH DATA STATEMENTS 


DATA UCOEF,VCOEF / 3000 * -1.0 / 


DATA M,N,RHO / 7,15,1.0 / 


DATA V / 500 * 0.0 / 


DATA RT,LFT,TOP,BOT / 100 * 100.0, 300 * 0.0 / 
DATA BCNDU,BCNDV / 1000 * 0.0 / 


ESTABLISH THE COEFFICIENT MATRICES BY OVER-WRITING ON THE 
DIAGONAL AND CERTAIN OFF DIAGONAL TERMS. 


10 


20 


30 


MSIZE = M* N 
DO 10 I = 1,MSIZE 
UCOEF(I,2) = 1.0/RHO + 


CONTINUE 

ML1 = M- 1 

NL1 = N- 1 

MSL1 = MSIZE - 1 

DO 20 I = N,MSL1,N 
UCOEF(I,3) = 0.0 
UCOEF (I+1,1) = 0.0 

CONTINUE 

DO 30 I = M,MSL1,M 
VCOEF (I,3) = 0.0 
VCOEF (I+1,1) = 0.0 

CONTINUE 


2.0 
VCOEF(I,2) = 1.0/RHO + 2.0 


NOW GET VALUES INTO THE BCOND VECTORS 


40 


45 


50 


DO 40 I = 1,N 
BCNDU(I) = TOP (I) 
J = MSIZE-N+I 
BCNDU(J) = BOT(I) 


CONTINUE 
po 45 I = 1,M 
OS eR aN 
BCNDU(J) = BCNDU(J) + LFT(I) 
[ake ae 
BCNDU(J) = BCNDU(J) + RT(I) 
CONTINUE 
po 50 I= 1,M 
BCNDV(I) = LFT(I) 
J = MSIZE - M+ I 
BCNDV(J) = RT(I) 
CONTINUE 


pO 55 I = 1,N 
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Figure 7.25 (continued) 


aanaana 


aaaaaq 


aaana aaaaaa 


aaaaa 


J = (I-1)*M +1 
BCNDV(J) = BCNDV(J) + TOP(I) 
J=I*M 
BCNDV(J) = BCNDV(J) + BOT(I) 
55 CONTINUE 


NOW WE GET THE LU DECOMPOSITIONS OF THE TWO COEFFICIENT MATRICES. 
THESE ARE VERY EASY TO COMPUTE. WE STORE THEM BACK IN THE 
ORIGINAL VECTOR SPACE. 


DO 60 I = 2,MSIZE 
UCOEF (I-1,3) = UCOEF(I-1,3) /UCOEF (I-1,2) 
UCOEF(I,2) = UCOEF(I,2) - UCOEF(I,1) *UCOEF (I-1,3) 
VCOEF (I-1,3) = VCOEF(I-1,3)/VCOEF(I-1,2) 
VCOEF(I,2) = VCOEF(I,2) - VCOEF(I,1)*VCOEF(I-1,3) 
60 CONTINUE 


NOW WE BEGIN THE ITERATIONS, LIMIT THEM TO ITMAX IN NUMBER, 


ITMAX = 30 
DO 190 KNT = 2,ITMAX,2 


COMPUTE R.H.S. FOR THE U EQUATIONS AND STORE THESE IN THE U 
VECTOR. FIRST DO THE TOP AND BOTTOM SETS OF TERMS. 


DO 65 I = 1,N 
J = (I-1)*M + 1 
U(I) = (1.0/RHO-2.0)*V(J) + V(J+1) + BCNDU(I) 
K = MSIZE-N+TI 
g=I*M 
U(K) = V(d-1) + (1.0/RHO-2.0)*V(J) + BCNDU(K) 
65 CONTINUE 


NOW DO THE INTERMEDIATE ONES. 


DO 75 I = 2,ML1 
pO 70 J = 1,N 
K (Pad) ON fA 
L= 1+ (J-1)*M 
U(K) = V(L-1) + (1.0/RHO-2.0)*V(L) + V(L+1) + BCNDU(K) 
70 CONTINUE 
Bist CONTINUE 


NOW GET THE SOLUTION FOR THE HORIZONTAL TRAVERSE, FIRST Y=L(-1)*B 


u(1) = U(1) / UCOEF (1,2) 
DO 80 I = 2,MSIZE 
U(I) = ( U(Z) - UCOEF(I,1)*U(I-1) ) / UCOEF(I,2) 
80 CONTINUE 
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Figure 7.25 (continued) 


C READY NOW TO GET X = O(-1)*yY. 


c 
DO 90 JROW = MSL1,1,-1 
U(JROW) = U(JROW) - UCOEF (JROW, 3) *U(JROW+1) 
90 CONTINUE 
S 
Cr ne 
c 
C WE DO THE SAME FOR THE VERTICAL TRAVERSE - U AND V EXCHANGE 
C ROLES. COMPUTE R.H.S. FOR THE V EQUATIONS, STORE IN V. DO 
C THE TOP AND BOTTOM SETS. 
le 
Do 95 I = 1,M 
J = (I-1)*N +1 
V(I) = (1.0/RHO-2.0)*U(J) + U(J+1) + BCNDV(T) 
K = MSIZE - M + I 
J=I*N 
V(K) = U(J-1) + (1.0/RHO-2.0)*U(J) + BCNDV(K) 
95 CONTINUE 
c 
Cn rr 
¢ 
C DO THE INTERMEDIATE ROWS 
¢ 
DO 105 I = 2,NL1 
DO 100 J = 1,M 
K = (I-1)*M + J 
L= 1+ (J-1)*N 
V(K) = U(L-1) + (1.0/RHO-2.0)*U(L) + U(L+1) +BCNDV(K) 
100 CONTINUE 
105 CONTINUE 
c 
le? aSeqen ee a ee ee 
c 
C GET THE SOLUTION - FIRST ¥ = L(-1)*B 
c 
v(1) = V(1) / VCOEF (1,2) 
DO 110 I = 2,MSIZE 
V(I) = ( V(I) ~ VCOEF(I,1)*V(I-1) ) / VCOEF(I,2) 
110 CONTINUE 
c 
Gi Pe dedeei se eee deme a ee 
c 
C THEN X = U(-1) * ¥ 
c 
DO 120 JROW = MSL1,1,-1 
V(JROW) = V(JROW) — VCOEF (JROW, 3) *V(JROW+1) 
120 CONTINUE 
c 
ic 
( 
C PRINT OUT THE LATEST RESULT. 
¢ 
PRINT 202, KNT, (( V(I),I=J,MSIZE,M ), J=1,M) 
202. FORMAT(///1X,’ AFTER ITERATION NUMBER ’,13/(1X,15F8.4) ) 
c 
C END OF THE ITERATION LOOP 
c 
190 CONTINUE 
STOP 
END 
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EXERCISES 
Section 7.2 
f 1, In Section 7.2, it is shown that flow of heat is a phenomenon for which the rate of flow is 
) equal to 
pa 
ax" 


where k is a constant, A is the cross-sectional area for heat flow, and du/dx is the temperature 
gradient. Suppose that k, instead of being constant, is a function of the position (x, y). Derive 
the equation for net flow of heat into the element dx dy. Does this reduce to Laplace's 
equation? 


nv 


Repeat Exercise 1, but for k varying with temperature, such as k = a + bu + cu?. Does the 
equation for net heat flow now reduce to Laplace's equation? How does the equation differ 
from that in Exercise 1? 


Section 7.3 
> 3. The mixed second derivative 42u/(dx dy) can be considered as 
a (%*) _ au a (**) 
ax\dy) — ax dy dy\ax]” 
Show that, in terms of finite-difference quotients with Ax = Ay, this derivative can be 
approximated by the pictorial operator 


‘ t= 
aa { , tes + O(n). 


4e?| 1 -1 


au Uy» + 16u,,, — 30u, + 16u,-, — uy-> 
— ~Mis2 i Mi I-24 QUASy, 
de 12h on) 


find the fourth-order operator for the Laplacian. Assume that the function u has a continuous 
sixth derivative. 


5. In some cases it is necessary to approximate a differential equation that contains the third 
derivative, d°u/ax*, by a difference equation. Set up an expression to approximate this third 
derivative in terms of finite differences. Can you find an expression that uses function values 
symmetrical to the point at which the derivative is being evaluated? (In finding the expressions, 
you may wish to compare the Taylor-series method of Section 7.3 with the methods of Chapter 
3 and the method of undetermined coefficients in the Appendix.) 


6. Suppose we have a rectangular plate with top and bottom edges that are perfectly insulated 
(du/ay = 0) while the right-hand edge is held at 100°, and the left edge at 0°. If heat flows 
in only two directions, it is obvious that the temperatures vary linearly in the x-direction, and 
are constant along vertical lines. 


a) Show that such a temperature distribution satisfies Eq. (7.6). 


b) Show that this temperature distribution also obeys the relationship derived in Exercise 4. 
What about points adjacent to the edges of the plate? 
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Section 7.4 
a ¢7. Set up equattes amslogoes © Eos. (7 §) for the example problem of Section 7.4. bat = 
2 gd spacing of 3] om Solve the st of equations by cimion 
S Solve for the scemty-stue temmperammes m 2 rectangular ple 12 by 15 mf ome 1S-m ck 
38 bel at 100" will the other ediecs mre afl bel at 2°. The menial is ede Take Ar 
= Sy = 3 me amd comsides beat © flow only m She itera rections. Skewch m the appro 
location of the SF isothermal carve. j 
9. Solve for the temperatures im the plane of Figere 726 when the oder tempersmees ee bel 
3s shown The plate is 10  S com amd Ar = Sy = 2m j 


10. Solve Exercise 7 using the sime-pommt formals of Eq (7.6c). Assume the the emperseees — 
‘ a the comer pomrs (20. 10) and (20_ 0) axe equal SF. 


PIL. Solve Exercse 9 wsime the sime-pomt forme 
12. Seppose the diferential equation for the pie were 
= + x, = 0 
What woeld Equation (7.60) and the pactorial operator (7.66) look Eke? 

13. The regice for witch we cam solve Laplace's equation does mot have to be wectmecie. We 
cam apply he methods of Section 7.+ for amy rector so lone as Ge meshes of cor netecek 
commende with the boundary. Solve for Ge steady-state potentials 2 Se cist temor pomes 
shows im Fie. 7.27. 


Q a 2 12 628 =O 


oe 
> 
8 


Figure 7.26 r . 


EXERCISES 539 


14. What fraction of the elements in the coefficient matrix of Exercise 8 are nonzero? What 
fraction if h = k = 1 in.? What fraction if h = k = 0.1 in.? How many equations must be 
solved in the latter case? Will your computer system handle that many coefficients in memory 
at the same time? 

Section 7.5 

»15. Repeat Exercise 7, but use Liebmann’s method. 

16. Solve Exercise 9 by Liebmann’s method with all the elements of the initial v vector equal 
to zero. Repeat with all the elements of the initial vector equal to 300, the upper bound to 
the steady-state temperatures. Repeat once more with the elements of the initial vector all 
equal to the arithmetic average of the boundary temperatures. Compare the number of iterations 
needed to reach a given tolerance for convergence in each case. 

17. Solve Exercise 8 by Liebmann’s method. Before you begin, make some quick hand com- 
putations to allow good estimates of starting values for the initial v vector. 

»18. Repeat Exercise 7, employing successive overrelaxation. Vary the overrelaxation factor to 
determine the optimum value. How does this compare to that predicted by Eq. (7.15)? 

19. Repeat Exercise 16, but use S.O.R. with the value of @,,, as given by Eq. (7.15). 

20. Repeat Exercise 17, using S.O.R. with the value of w,, as given by Eq. (7.15) 

Section 7.6 

»21. Find the torsion function @ for a 2 x 2 in square bar. 
a) Subdivide the cross section into nine equal squares, so that there are four interior points: 
Because of symmetry, all four values of ¢ are the same. 
b) Then subdivide the cross section into 36 equal squares so that there are 25 interior points. 
Use the results of (a) to estimate starting values for iteration. 

22. Solve for the torsion in a hollow square bar, 5 in in outside dimension, and with walls 2 in 
thick (so that the inside hole is 1 in on a side). On the inner surface as well as on the outer, 
b=0. 

23. Solve forthe torsion in a prismatic bar of a cross section similar to the region of Exercise 
13. Assume the eight points are 0.5 cm apart. 

24. Solve V2u = f(x, y) on the square region bounded by 

x=0, x=1, y=0, y=1, 
with f(x, y) = xy. Use h = 4. Take u = 0 on the boundary. 
»25. Suppose the function defined on the rectangular plate, Fig. 7.7, were given as 
3b, + b, +2 =0, = 0 on the boundary. 
What would the pictorial diagram and the set of equations in (7.17) and (7.18) look like if 
you were using the five-point method? 
»26. Repeat Exercise 21(b), but use S.O.R. Vary the overrelaxation factor to find the optimum 
How does this compare to the value from Eq. (7.15)? 
27. Repeat Exercise 22, using S.O.R. Vary w to find the optimum value. 
28. Repeat Exercise 24, using f(x, y) = (1 — x)(1 — y). Compare your results to those of Exercise 


24, 
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Section 7.7 


2a. 


30. 


Given the pair Of equations 
2xy,+ m= 24 
x + 2x, = —12; 


2 0/20 =o) =o) 
w 2 0 20 — w) J" 
b) Using the arguments of Section 7.7 that A; = A; = w — | and the trace argument, show 
that w,,, = 1.0718. 
c) What is the largest-magnitude eigenvalue for the Gauss—Seidel matrix (for w = | in part 
(a))? 
d) Evaluate x‘), x), x), using both Gauss-Seidel and with the optimal w = 1.0718, In 
both cases, use the zero vector as the starting vector. The exact answer is {20 —16}. 


a) Find B where 


B= (D+ wL)' (D — wD — wv) 


Show that the optimal value for w is less than | for the example used in Sections 2.11 and 
6.7. Do this by writing a program that implements Eq. (7.21), using various values for 
between 0 and 2. (You should find the optimal value is between 0.9 and 1.0.) 


31. What are the eigenvalues for the iteration matrix in Problem 30 when w has the optimal 
value? 

Section 7.8 

32. Solve the set of equations in Eqs. (7.27), using (7.28) to eliminate u,, u,, u,, and uy. 

a) Use elimination. 
b) Use Liebmann’s method. 
c) Use S.O.R. 
»33. Solve the example problem of Section 7.8, except that the 8-cm edges are held at 20° and 
the 4-cm edges have an outward gradient of —15°C/cm. 

34. Suppose the outward normal gradient on all the edges of the example problem of Section 7.8 
is —15°C/cm. Can the problem be solved? 

35, Solve for the steady-state temperatures of the region in Exercise 13, except that the plate is 
insulated at each exterior point marked zero. The edge temperatures are maintained at the 
temperatures given at the other exterior points. 

36. Consider a region that is obtained from Exercise 13 by reflecting the figure across both of 
the edges marked zero (see Fig. 7.28). Show that the steady-state potentials at points in the 
original region are the same as for Exercise 35. 

Section 7.9 

»37. Find the potential distribution in a region whose shape is a 3-4-5 right triangle. The potential 
is maintained at zero except on the hypotenuse. where it is 50. Use a square mesh with h = 

1, 
38. A coaxial cable has a circular outer conductor 10 cm in diameter, and a square inner conductor 


3 cm on a side. (The square conductor is concentric.) The outer conductor is at zero volts 
while the inner is at 100 volts. Find the potential between the conductors. (Note that only 
one octant needs to be calculated, because of symmetry.) Use a grid of I-cm squares. 


EXERCISES 541 


39. If a hollow shaft has an outer circular cross section of diameter 10 in and a concentric square 
inner cross section which is 3 in on a side, what is the value of the torsion function at the 
nodes of a 1-in grid? 

40. Find the solution to the Poisson equation over an equilateral triangle 5 in on a side if 

Vu =k, u = 0 on boundary, 
and values of k are as shown in Fig. 7.29. 
>41. Use the Laplacian in polar coordinates to set up the set of difference equations to solve 
Vu = 0.2, 


on a semicircular region whose radius is 4. Take Ar = 1 and A@ = 77/6. Boundary conditions 
are u = 10 on the straight edge and u = 0 on the curved boundary. 


100 100 100 100 


100 100 100 


100 


100 100 


100 100 


Figure 7.28 100 100 100 100 
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Section 7.10 
»42. A cube is 3. cm along its edge. Two opposite faces are held at 100°, the other four faces at 
0°. Find the internal temperatures at the nodes of a 1-cm network. 
43. Repeat Exercise 42 except that one of the 0° faces has been insulated. 


Section 7.11 


»44. Solve Exercise 9 by the A.D.I. method, using p = 1.0. Begin with the initial uv vector having 
all elements equal to the arithmetic average of the boundary values. Compare the number of 
iterations to those required with Liebmann’s method (Exercise 16) and S.O.R. using Wop, 
(Exercise 19). 

45. Solve the sets of equations in (7.27) and (7.28) by A.D.1. with p = 1.0. Compare the number 
of iterations required with the number needed by S.O.R. as performed in Exercise 32(c). 


46. Repeat Exercise 21, but use the A.D.I. method. Vary the value of p to find the optimum 


value. 
Section 7.12 
47. Find M~', 2A, c, N, and u(x, y) for these triangular elements: 
x v *2 y2 x3 ¥3 uy M2 U3 x y 
a) 1.25 3.3 0.25 44 =2 =2 10 20 5 0 3 
b) 30 40 10 50 20 10 12.3 i) 222 23 34 
ce) 12.1 11.3 8.6 13.2 9:3) 79 121 215 67 7.7 9.8 


48. Solve Exercise 9 by finite elements. Locate four internal nodes at (4, 2), (8, 4), (2, 4), and 
(6, 6) relative to the lower left commer. Compare answers from this solution to the values 
obtained through finite differences with grid spacing of 2 cm. 

49. Solve Exercise 13 by finite elements. Locate internal nodes at the points numbered 1, 4, and 
7. Compare the answers to those of Exercise 13. 

50. Use the finite-element method to solve for the torsion function within a bar whose cross 
section is an equilateral triangle. Each face of the bar is 3 in wide. If the internal nodes are 
placed symmetrically, symmetry in the problem can reduce the number of unknowns. Compare 
your solution to the analytical answer: 

_ (1/2)(x? + y?) + (1/2a)(x3 — 3xy?) — (2a?) 

¥ 27 


o 


where a is the distance from one face to the opposite vertex; x and y in the equation are 
telative to the centroid of the cross section. 


APPLIED PROBLEMS AND PROJECTS 


51. Modify and use the program for Poisson's equation in Section 7.14 to solve 
V2u = xy(x — 2)(y — 2), 


on the region 0 < x < 2, 0 < y < 2, with u = 0 on all boundaries except for y = 0, where 
u= 1.0. 


= 


52; 


53. 


54. 


56. 
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Modify Program 2 so that it permits boundary conditions of the form 


a a 
au + bt =c and aut bo =e’, 
ax a) 


Test it with the equation in Problem 51, modified in that the boundary where y = 0 has 
du/dy = 1. 

Repeat Problem 52, except make the modifications to Program 3. You will need to make 
changes to permit a Poisson-equation problem as well as derivative boundary conditions. 


A classic problem in elliptic partial-differential equations is to solve V?u = 0 on a region 
defined by 0 < x = 7, 0 <y < %, with boundary condition of u = 0 at x = 0, atx = 7, 
and at y = %. The boundary at y = 0 is held at u = F(x). This can be quite readily solved 
by the method of separation of variables, to give the series solution 


w=> B,e~" sin nx, 
met 


with 
7 
B, = f° F(x) sin nx dx. 


Solve this equation numerically for various definitions of F(x). (You will need to redefine 
the region so that 0 <y < M, where M is large enough that changes in u with y at y = M 
are negligible.) Compare your results to the series solution. You might try 

F(x) = 100; F(x) = 100 sinx; F(x) = 400x(1 — x); F(x) = 1001 — |2x — 1). 
The equation 


is an elliptic equation. Solve it on the unit square, subject to u = 0 on the boundaries. 
Approximate the first derivative by a central-difference approximation. Investigate the effect 
of size of Ax on the results, to determine at what size reducing it does not have further effect. 


Solve the steady-state heat-flow problem that was discussed at the beginning of the chapter. 
Assume that the metal is aluminum and that the insulation is mineral wool. Look up the 
thermal properties in handbooks. 

Solve it first as a three-dimensional problem. Then, considering that the plate is not very 
thick relative to its length and that aluminum is a good conductor so that there is little 
temperature variation across the thickness, solve it as a two-dimensional problem. Do you 
reach the same conclusion about the feasibility of the scheme? 

Consider carefully how you will handle the problem of dissimilar materials—metal and 
insulation. Can you approximate the effect of the insulation by some proper value of the 
gradient at the surface of the aluminum? 


= 


Parabolic Partial-Differential 
Equations 


8.0 CONTENTS OF THIS CHAPTER 
a 


The previous chapter discussed the solution of partial-differential equations that 
were time-independent. Such steady-state problems were described by elliptic 
equations. Unsteady-state problems in which the function is dependent on time 
are of great importance. We study these in this chapter. 

The method of finite-difference equations will be further developed in this 
and the following chapter, since it is very important in solving parabolic and 
hyperbolic partial-differential equations. In the realm of available software, one 
can use PDECOL, distributed by IMSL, for solving a system of parabolic or 
hyperbolic partial-differential equations as well as boundary-value problems. 
Finite-element programs can also solve these. 

The problem of unsteady-state flow of heat is a physical situation that can 
be represented by a parabolic partial-differential equation. The simplest situation 
is for flow of heat in one direction. Imagine a rod that is uniform in cross section 
and insulated around its perimeter so that heat flows only longitudinally. Consider 
a differential portion of the rod, dx in length with cross-sectional area A (see 
Fig. 8.1). 

We let u represent the temperature at any point in the rod, whose distance 
from the left end is x. Heat is flowing from left to right under the influence of 
the temperature gradient du/dx. Make a balance of the rate of heat flow into and 
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Area= A | % f= 
CaaS: 
Figure 8.1. L —| 


out of the element. Use k for thermal conductivity, cal/g - cm? - °C/cm, which 
we assume is constant. 


Rate of flow of heat in: — 1a 3 


a a 
Rate of flow of heat out: ~ea(H + 2 (%)ar). 
ox Ox 
The difference between the rate of flow in and the rate of flow out 
is the rate at which heat is being stored in the element. If c is the heat capacity, 
cal/g « °C, and p is the density, g/cm, we have, with ¢ for time, 


ou ou, 9d nae ou 
TkAZ = (- =a( a + 2(@)ar)) = cp(A dx) 


Simplifying, we have 


This is the basic mathematical model for unsteady-state flow. We have 
derived it for heat flow, but it applies equally to diffusion of material, flow of 
fluids (under conditions of laminar flow), flow of electricity in cables (the tele- 
graph equations), and so on. 

The differential equation is classed as parabolic because comparing to the 
standard form of an equation in two independent variables, x and 1, 

aoe a°u a 


u du du 
Axa 2 37 cy a7 D(x. ou, = =, 


we find that B? — 4AC = 0. 
In two or three space dimensions, analogous equations apply: 
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The function that we call the solution to the problem not only must obey 
the differtential equation given above, but also must satisfy an initial condition 
and a set of boundary conditions. For the one-dimensional heat-flow problem 
that we first consider, the initial condition will be the initial temperatures at all 
points along the rod, 


u(x, D|,-9 = u(x, 0) = f(x). 


The boundary conditions will describe the temperature at each end of the 
rod as functions of time. Our first examples will consider the case where these 
temperatures are held constant: 


uO, t) = cy, 
u(L, 1) = cp. 


More general (and more practical) boundary conditions will involve not only the 
temperature but the temperature gradients, and these may vary with time: 


Ayu(0, t) + B28 wa) D = FW, 


Apu(L, t) + pid He Ras 


In Chapter 8, you will study the application of finite differences to solve parabolic 
partial-differential equations. We begin with problems of one space dimension. 


8.1 THE EXPLICIT METHOD 
Divides space and time into discrete uniform subintervals and replaces both time 
and space derivatives by finite-difference approximations, permitting one to easily 
compute values of the function at a time Ar after the initial time. These values are 
then used to compute a second set of values and the process is repeated. 

8.2 CRANK-NICOLSON METHOD 
Overcomes the limitations that the explicit method places on the size of Ar and 
gives improved accuracy at the expense of having to solve a set of equations at 
each time step. Fortunately, the system is tridiagonal for a one-space-dimension 
problem. 

8.3 DERIVATIVE BOUNDARY CONDITIONS 
Really impose no new difficulties because we can extend the region artificially and 
replace the derivatives at the boundaries also by finite differences. 

8.4 STABILITY AND CONVERGENCE CRITERIA 
Shows you why there were limitations on the size of Ar in the explicit method and 
none with implicit methods. 
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8.5 PARABOLIC EQUATIONS IN TWO OR MORE DIMENSIONS 
Extends finite-difference methods to regions of more than one dimension; we dis- 
cover that the number of equations to be solved in implicit methods grows tremen- 
dously, and they are not tridiagonal. A modification, the A.D.I. method, really 
helps the situation. 

8.6 FINITE ELEMENTS FOR HEAT FLOW 
Demonstrates that this newer procedure readily applies, though it leads into many 
simple but tedious steps to set up a system of linear equations that give the solution 
to the problem. 

8.7 CHAPTER SUMMARY 
Helps you to review your knowledge of the important topics of the chapter. 

8.8 COMPUTER PROGRAMS 
Illustrates how FORTRAN programs can solve parabolic partial-differential equa- 
tions by finite differences. 


THE EXPLICIT METHOD 


Our approach to solving parabolic partial-differential equations by a numerical method 
is to replace the partial derivatives by finite-difference approximations. For the one- 
dimensional heat-flow equation, 


pa 8. 
axe k at eit) 
we can use the relations 
07u ui,, — 2ut + ut _, a 
= Sleex = + O(Ax)? (8.2) 
Ox? [=p (Ax? Mey) 
and 
rs en 
ou ul ul) 
Alege, ae EOLA 8.3) 
ar bisxi At eo ( 


We use subscripts to denote position and superscripts for time. Note that the error 
terms are of different orders since a forward difference is used in Eq. (8.3). This introduces 
some special limitations, but when this is done, the procedure is simplified. 
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EXAMPLE 


Substituting Eqs. (8.2) and (8.3) into (8.1) and solving for u/~? gives the equation 
for the forward-difference method: 


We have solved for z/*! in terms of the temperatures at time 1; in Eq. (8.4) in view 
of the normally known conditions for a parabolic partial-differential equation. We sub- 
divide the length into uniform subintervals and apply our finite-difference approximation 
to Eq. (8.1) at each point where w is not known. Equation (8.4) then gives the values of 
u at each interior point at ¢ = 1, since the values at t = fp are given by the initial conditions. 
It can then be used to get values af 1, using the values at 1, as imitial conditions, so we 
can step the solution forward in time. At the endpoints, the boundary conditions will 
determine u. 

The relative size of the time and distance steps, At and Ax, affects Eq. (8.4). If the 
ratio of Ar/(Ax}* is chosen so that k Ar/ep(Ax? = 4, the equation is simplified in that 
the last term vanishes and we have 


If the value of & Ar/ep(Ax} is chosen as less than one-half, there will be improved 
accuracy* (limited, of course, by the errors dependent on the size of Ax). If the value 
is chosen greater than one-half. which would reduce the number of calculations required 
to advance the solution through 2 given interval of time, the phenomenon of instability 
sets in. (We show later in this chapter why this ratio affects stability and convergence.) 

We illustrate the method with a simple example, varying the value of k Ar/cp(Ax? 
to demonstrate its effect. 


A large fiat steel plate is 2 cm thick. If the initial temperatures ((C) within the plate are 
given, as a function of the distance from one face. by the equations 


«= 10 for O=x 
w=1002-x) tri=x 


find the temperatures as 2 function of x and r if both faces are maintained at 0°C. 


“ht cum be shown that choosing the zutio equal to 2 is especially advantegeoes im minimizing the tencaton 


Table 8.1 


8.1; THE EXPLICIT METHOD 549 


Since the plate is large, we can neglect lateral flow of heat relative to the flow 
perpendicular to the faces and hence use Eq. (8.1) for heat flow in one direction. For 
steel, k = 0.13 cal/sec - cm - °C, ¢ = 0.11 cal/g - °C and p = 7.8 g/cm®. In order to 
use Eq. (8.5) as an approximation to the physical problem, we subdivide the total thickness 
into an integral number of spaces. Let us use Ax = 0.25, giving eight subdivisions. If 
we wish to use Eq. (8.5), we then fix Ar by the relation 


kAr _ 


= — (0.11)(7.8)(0.25)? 
cp(Ax)? 


A= 2)(0.13) 


= 0.206 sec. 


ri 


The boundary conditions are 
“0,0 =0,  u(2,1 = 0. 
The initial condition is 


u(x, 0) = 100x for 0 =< x 
u(x, 0) = 10012 — x) fort sx 


A iA 


2 


Our calculations are conveniently recorded in a table, as in Table 8.1, where each 
row of figures is at a particular time. We begin by filling in the initial conditions along 
the first row, at r = 0.0. The simple algorithm of Eq. (8.5) tells us that, at each interior 
point, the temperature at any point at the end of a time step is just the arithmetic average 
of the temperatures at the adjacent points at the beginning of that time step. The end 
temperatures are given by the boundary conditions. Because the temperatures are sym- 
metrical on either side of the center line, we compute only for x = 1.0. The temperature 
at x = 1.25 is the same as at x = 0.75. 


Numerical solution to heat-flow example 
a 


Calculated temperatures at 


t x=0 x = 0.25 x = 0.50 x = 0.75 T= 40 x = 1.25 


0.0 0 50.0 75.0 100.0 75.0 
0.206 0 50.0 75.0 75.0 75.0 
0.412 0 50.0 62.5 75.0 62.5 
0.619 0 43.75 62.5 62.5 62.5 
0.825 0 43.75 53.12 62.5 53.12 
1.031 0 37.5 Sale S812: 53.12 
1.238 0 37.5 45.31 53.12 45.31 
1.444 0 32.03 45.31 45.31 45.31 
1.650 0 32.03 38.67 45.31 38.67 
1.856 0 27.34 38.67 38.67 38.67 
2.062 0 27.34 33.01 38.67 33.01 
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Figure 8.2 


80 - 


Temperature, u 


EE SP EY a LC 
a ih °% 


Time steps 


In Fig. 8.2, we compare some of the numerical results with the analytical solution, 
which is 


Qn + Ia - 1-0 3738(2n+ 1)?" 


u = 800 


cos 
nwo 7(2n + 1) 2 


Note that the numerical values are close to and oscillate about the curves that are 
drawn to represent the analytical solution. In general the errors of Table 8.1 are less than 
4%. If the size of Ax were less, the errors would be smaller. Unfortunately, this easy 
method is not always so accurate. 

When the value for k At/cp(Ax)? in Eq. (8.4) is other than 0.5, the resulting equation 
is slightly more complicated, but the related computer program is not significantly slower. 
The major effect is the size of the time steps; reducing the value requires more successive 
calculations to reach a given time after the start of heat flow. Increasing the ratio to a 
value greater than 0.5 would decrease the number of successive calculations and hence 
reduce the computer time needed to solve a given problem. An extremely important 
phenomenon occurs, however. when the ratio is >0.5; as illustrated in Fig. 8.3 where 
temperature profiles are shown at two different times, t = 0.99 sec and t = 1.98 sec, 
very inaccurate results may occur. The curves show the solution from the infinite series, 


Figure 8.3 


EXAMPLE 
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The open circles, with r = 0.6, show extreme oscillation: The calculated values are very 
imprecise, and completely impossible behavior is represented. The data for r = 0.4 are 
smooth and almost exactly match the series solution. 

The reason for this great difference in behavior is that the choice of the ratio 
k At/cp(Ax = 0.6 introduces instability in which errors grow at an accelerating rate as 
time increases, (At values of t > 2, negative values of u are computed, a patently impos- 
sible situation.) As indicated in Fig. 8.3, when this ratio is taken at 0.4, the results are 
excellent. 

In Fig. 8.2 with k Ar/cp(Ax)? = 0.5, the limiting value to avoid instability was 
used. Even here, the oscillation of points around the curves indicates its borderline value, 
and the accuracy is hardly acceptable. The phenomena of stability and convergence, more 
fully discussed in a later section of this chapter, set a maximum value of 0.5 to the ratio 
k At/cp(Ax)?. When derivative boundary conditions are involved, the ratio must be less 
than 0.5. Discontinuities in the initial conditions—even a discontinuity in the gradient, 
du/dx, at t = 0—cause the accuracy to be poor at the maximum value. The next example 
will demonstrate this. 

We will now solve a problem in diffusion, which is governed by the same mathe- 
matical equation as is heat conduction in a solid. 


A hollow tube 20 cm long is initially filled with air containing 2% of ethyl alcohol vapors. 
At the bottom of the tube is a pool of alcohol that evaporates into the stagnant gas above. 
(Heat transfers to the alcohol from the surroundings to maintain a constant temperature 
of 30°C, at which temperature the vapor pressure is 0.1 atm.) At the upper end of the 
tube, the alcohol vapors dissipate to the outside air, so the concentration is essentially 
zero. Considering only the effects of molecular diffusion, determine the concentration of 
alcohol as a function of time and the distance x measured from the top of the tube. 
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Table 8.2 


Molecular diffusion follows the law 


where D is the diffusion coefficient, with units of cm?/sec in the cgs system. (This is the 
same as for the ratio k/cp, which is often termed thermal diffusivity.) For ethyl alcohol, 
D = 0.119 cm?/sec at 30°C, and the vapor pressure is such that 10 volume percent } : 
alcohol in air is present at the surface. | 

Subdivide the length of the tube into five intervals, so Ax = 4cm. Using the maximum 
value permitted for Ar yields 


im SYOI9 = Ar = (67 Discs: 


il 
(Axp 2 @P 
Our initial condition is c(x, 0) = 2.0. The boundary conditions are 
c(O, 1) = 0.0, (20, 1) =10.0. 


The concentrations are measured by the percent of alcohol vapor in the air. 

The computations are shown in Table 8.2. Again, using Eq. (8.5), each interior 
value of ¢ is given by the arithmetic average of concentrations on either side in the line 
above. A little reflection is required to determine the proper concentrations to be used 
for x = 0 and x = 20 at t = 0. While these are initially at 2%, we assume they are 
changed instantaneously to 0% and 10% because the effective concentrations acting 
during the first time interval are not the initial values but the changed values. We have 
accordingly rewritten these values as shown in the first row of Table 8.2. 


Diffusion of alcohol vapors in a tube—solution of the equation 


ups Sul, + us.) 


Concentration of alcohol at 


Time, 

sec x=0 x=4 x=8 x= 12 += 16 x = 20 

0 0.0 2.0 2.0 2.0 2.0 10.0 
67.2 0.0 1.00 2.00 2.00 6.00 10.0 
134.4 0.0 1.00 1.50 4.00 6.00 10.0 
201.6 0.0 0.75 2.50 3.75 7.00 10.0 
268.8 0.0 1.25 2.25 4.75 6.875 10.0 
336.0 0.0 1.125 3.00 4.562 7.375 10.0 
403.2 0.0 1.500 2.844 5.188 7.281 10.0 
470.4 0.0 1.422 3.344 5.062 7.594 10.0 | 
537.6 0.0 1.672 3.242 5.469 7.531 10.0 
604.8 0.0 1.621 3.570 5.386 7.734 10.0 
Steady 

state 0.0 2.0 4.0 6.0 8.0 10.0 


Figure 8.4 
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Analytical solution: 


ux 0) = 54 22S erv0ns gin OTH 2S 


oa 
7 


00029420" 11" gin 


(2n — I)7x 
20 


u% 3 


0 4 4 4 4 1 1 n 
0 z 4 6 8 10 12 


Number of time steps 


As time passes, the concentration will become a linear function of x, from 10% at 
x = 20 to 0% at x = 0, The calculated values approach the steady-state concentrations 
as time passes. However, as Fig. 8.4 shows, the su 
about the values calculated from the analytical solution.” « 


ive calculated values oscillate 


While the algorithm of Eg. (8.5) is simple to use, the results leave something to be 
desired in the way of accuracy. The oscillatory nature of the results, when a smooth trend 
of the concentration is expected, is also of concern. We can improve accuracy by 
smaller steps, but if we decrease Ax, the time steps must also decrease, because the 
ratio D At/(Ax)* cannot exceed 4 Cutting Ax in half will require making Ar one-fourth 
of its previous value, giving a total of eight times as many calculations. With the mixed 
order of error terms, it is not obvious what reduction of error this would give, but they 
should be reduced about fourfold. 


*In this instance the analytical solution is given by the infinite series shown in Fig. 8.4 
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Table 8.3 


We can reduce Ar without change in Ax, however, because the method is stable for 
any value of the ratio less than }. Suppose we make D Ar/(Ax)* = i. The basic difference 
equation, (8.4), now becomes 


1 1 
ul = 2(uh yy + why) + ui. 


Table 8.3 summarizes the use of Eq. (8.6) on the same example as before. As shown 
more clearly by Fig. 8.5, this last calculation causes the calculated concentrations to 
follow smooth curves, and the error is reduced considerably. The poorest accuracy is 
near the beginning. The later values are close to the curve. 

This initial poor accuracy is a result of the discontinuities in the boundary conditions. 
The abrupt change of concentrations at the ends of the tube makes the explicit method 
inaccurate. In a situation with continuity of the initial values of the potential plus continuity 
of its derivatives and the absence of discontinuities in the boundary conditions, one would 
get better accuracy throughout. 


Diffusion of alcohol vapors in a tube: 
+ ! | 
De quinn + Ula) + gui 


Concentration of alcohol at 


Time, ——S 
sec x=0 x=4 x=8 x= 12 x= 16 x= 20 
0.0 0.0 2.0 2.0 2.0 10.0 
33.6 0.0 2.00 2.00 4.00 10.0 
67.2 0.0 1.875 2.50 5.00 10.0 
100.8 0.0 1.875 2.969 5.625 10.0 
134.4 0.0 1.953 3.360 6.055 10.0 
168.0 0.0 2.070 3.682 6.368 10.0 
201.6 0.0 2.204 3.950 6,604 10.0 
235.2 0.0 2.343 4.177 6,790 10.0 
268.8 0.0 2.480 4.372 6.939 10.0 
302.4 0.0 2.612 4.541 7.062 10.0 
336.0 0.0 2.736 4.689 7.166 10.0 
369.6 0.0 2.850 4.820 7.255 10.0 
403.2 0.0 2.956 4.936 7.332 10.0 
436.8 0.0 3.053 5.040 7.400 10.0 
470.4 0.0 3.142 5.133 7.460 10.0 
504.0 0.0 3.223 5.217 7.513: 10.0 
537.6 0.0 3.296 5.292 7.561 10,0 
Steady 
state 0.0 2.0 4.0 6.0 8.0 10.0 


Figure 8.5 
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The method presented in this section is called the explicit method because each new 
value of the potential can be immediately calculated from quantities that are already 
known, It is simple and economical of calculation effort but has a severely limited upper 
value for the ratio k At/cp(Ax)?. We can remove this limitation by going to an implicit 
method. 


CRANK-NICOLSON METHOD 


When the difference equation, Eq. (8.4), was derived, we noted that a mixed order of 
error was involved because a forward difference was used to replace the time derivative 
while a central difference was used for the distance derivative. However, the difference 
quotient, (u*! — u/)/At, can be considered a central difference if we take it as corre- 
sponding to the midpoint of the time interval. Suppose we do consider (u/*! — w/)/Ar 
as a central-difference approximation to du/dr, and equate it to a central-difference quotient 
for the second derivative with distance, also corresponding to the midpoint in time, by 
averaging difference quotients at the beginning and end of the time step. Then 
aru cp ou 


ax? ik at 
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EXAMPLE 


is approximated by 


ee 


, . 4 F zs 
1 fithng = 20) tule, ep Oe scp fet) ie 
2 (Ax)? (Ax? k Ar 
Central difference Central difference Central difference 
at f, at they at ty12 


When this is rearranged we get the Crank—Nicolson formula, where 


r = k At/cp(Ax?, 


—rudt} + (2 + nuit! — rad) = rud_, + (2 — 2nud + rd ,,. 


Letting r = 1, we get some simplification: 


i 
+ Uday. 


—yith Jt) yitl = yl 
wn + 4uy Wi+1 = 4i-1 


As we will discuss in the next section, one advantage of the Crank—Nicolson method 
is that it is stable for any value of r, although small values are more accurate. 

Equation (8.7) is the usual formula for using the Crank—Nicolson method. Note that 
the new temperture w/*! is not given directly in terms of known temperatures one time 
step earlier, but is a function of unknown temperatures at adjacent positions as well. It 
is therefore termed an implicit method in contrast to the explicit method of the previous 
section. With implicit methods, the values of u at ¢ = 1, are not just a function of values 
at ¢ = fg, but also involve the other u values at the same time step. This requires us to 
solve a set of simultaneous equations at each time step. 


We illustrate the method by solving the problem of diffusion of alcohol vapors, the same 
as that previously attacked by the explicit method. Again, let us take Ax = 4 cm and 
take r = 1. Restating the problem, we have 


2, 


u(0, 1) = 0. 


u F 
u(20, 1)= 10.0, 


= 1% 
ac Dar’ 


u(x, 0) = 2.0, { 
D = 0.119 cm?/sec. 


If D Ar/(Ax)? = r = 1 and Ax = 4 cm, then Ar = 134.4 sec. For t = 134.4, at the end 
of the first time step, we write the equations at each point whose concentration is unknown. 
The effective concentrations at the two ends, 0% and 10%, respectively, are again used: 


Table 8.4 


8.2; CRANK-NICOLSON METHOD 557 


This method requires more work because, at each time step, we must solve a set of 
equations similar to the above. Fortunately, the system is tridiagonal. Since we will need 
to repeatedly solve the system with the same coefficient matrix (we transpose the constant 
values to the right side in the first and last equations), it is desirable to use an LU method. 
We then solve by back-substitutions, using the elements of L and U: 


1 0 0 0 1 -0.25 0 0 
Ke = 3s 0 0 ie 0 1 —0.267 0 

0 =1 3.733 0 0 0 1 —0.268 

0 0 =] 35732, 0 0 1 


Remember that, to solve Ax = b = LUx, we first solve Ly = b (which requires no 
reduction of L because it is already triangular) and then we solve Ux = y by back- 
substitution. Observe that when A is tridiagonal, its zeros are preserved in L and U, and, 
in fact we need to compute only 2( — 1) new values to get the LU equivalent of the 
n X n tridiagonal matrix A. Not only is the computational effort minimized, but we 
can also minimize computer memory space by storing only the three nonzero values in 
each row. In fact, as disussed in Chapter 2, we can store the elements of both L and U 
in place of the original values in A, 


Diffusion of alcohol vapors in a tube—Crank—Nicolson method 
Nn re 


Calculated concentration values at 


Time, —— 
sec x=0 xr=4 x=8 x= 12 x= 16 x= 20 
0.0 0.0 2.0 2.0 2.0 2.0 10.0 
134.4 0.0 0.980 2.019 3.072 5.992 10.0 
268.8 0.0 070 2,363 4.305 6.555 10.0 
403.2 0.0 1.276 2.861 4.762 6.962 10.0 
537.6 0.0 1.471 3.165 5.115 TAS9 10.0 
Analytical values 
Time, 
sec x=4 t= 12 
134.4 1.078 3.191 
268.8 1.108 4.272 
403.2 1.340 4.873 


aa oes = ee 
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8.3 


In Table 8.4 the results of calculations by the Crank—Nicolson method are listed. 
For the first line of calculations the b-vector has components 2.0, 4.0, 4.0, and 22.0, 
and the calculated values are the components of A~'b. In computing the second line, the 
components of b are 2.019, 4.052, 8.011, and 23.072. Succeeding lines use the proper 
sums of values from the line above to determine the b-vector. 

When the calculated values are compared to the analytical results, it is observed that 
the errors of this method, in this example, are about the same as those made in the explicit 
method with D Ar/(Ax)? = i. We require only one-fourth as many time steps, however, 
to reach 537.6 sec. With smaller values of Ax, this freedom of choice in the value of 
D At/(Ax) is especially valuable. = 


DERIVATIVE BOUNDARY CONDITIONS 


In heat-conduction problems, the most usual situation at the endpoints is not that they 
are held at a constant temperature, but that heat is lost by conduction or radiation at a 
Tate proportional to some power of the temperature difference between the surface of the 
body and its surroundings. This leads to a relationship involving the space derivative of 
temperature at the surface. In the analytical solution of heat-flow problems through Fourier 
series, this adds considerable complication in determining the coefficients, but our numer- 
ical technique requires only minor modifications. 
The rate of heat loss from the surface of a solid is generally expressed as 


Rate of heat loss = hA(u — up), (8.8) 


where A is the surfce area, u is the surface temperature, uo is the temperature of the 
surrounding medium, and A is a coefficient of heat transfer; h is increased by motion of 
the surrounding medium. To facilitate heat flow, the surrounding medium is often caused 
to flow rapidly by mechanical means (as in some salt evaporators) or by proper design 
(as in vertical-tube heat exchangers in distillation columns). This situation is called forced 
convection. 

In other situations, h is also a function of the surface temperature, as in natural 
convection, in which the motion of surrounding fluid is caused by thermal currents. 
Another important situation is heat loss by radiation. in which the rate of heat loss is 
proportional to the fourth power of the temperature difference between the surface and 
the surrounding surfaces to which heat is being radiated. In both these situations, the rate 
of heat loss is proportional to some power of the surface temperature. This gives rise to 
nonlinear equations that are not as readily solved. One usually prefers to force the 
mathematical model to a linear form even though this is not exactly true. One way to 
approximate this is to use Eq. (8.8). absorbing the nonlinear aspects into the coefficient 
h, and to change h appropriately through the progress of the calculations so that it takes 
on a reasonably correct average value. Our examples will treat only the simpler situations 
where Eq. (8.8) applies directly. 


EXAMPLE 
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As we will see, heat loss from the surface by conduction leads to a derivative boundary 
condition. Our procedure will be to replace the derivatives in both the differential equation 
and in the boundary conditions by difference quotients. We illustrate with a simple 
example. 


An aluminum cube is 4 x 4 x 4 in. (k = 0,52, c = 0.226, p = 2.70 in cgs units). All 
but one face is perfectly insulated, and the cube is initially at 1000°F. Heat is lost from 
the uninsulated face to a fluid flowing past it according to the equation 


Rate of heat loss (in Btu per sec) = hA(u — uo), 


i] 


where h = 0.15 Btu/sec « ft? + °F, 
A = surface area in ft*, 

= surface temperature, °F, 

Uy = temperature of fluid, °F. 


If uo, the temperature of fluid flowing past the aluminum cube, is constant at 70°F, find 
the temperatures inside the cube as a function of time. While we could work the problem 
in cgs units, we elect to use English units, making suitable changes in k, c, and p. 

Because of the insulation on the lateral faces, again the only direction in which heat 
will flow is perpendicular to the uninsulated face, and the equation is 


du _ k a*u 


or cp ax?’ 
The intitial condition, with x representing distance from the uninsulated face, is 
u(x, 0) = 1000. 


For boundary conditions, at the open surface we have 


= —hA(u — 70), 


x=0 


because the rate at which heat leaves the surface must be equal to the rate at which heat 
flows to the surface. The negative sign on the left is required because heat flows in a 
direction opposite to a positive gradient; on the right it occurs because heat is being lost 
in the direction of negative x. At the other side of the cube, the rate of heat flow is zero 
because of insulation: 

ou 

=| =0. 

OX | ag 
We plan to use the explicit method with 


r = k At/cp(Ax? = is 


*The ratio must be smaller than 5 to give stability with the derivative end conditions. 


6) ee a " Paes 
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Suppose we let Ax = | in. To calculate Ar, we need k, c, and p expressed in the 
units of inches, péunds. seconds, and °F. We first convert units: 


7 cal 1 Btw \/2.54 em) / °C 
ce (0.2 Sc - cm eles =) ( Tin )( a) 
Bu 

= 0.00291 
a cal 1 Bu \/454 g\/ 1°C 

; a (0.206; : Aes =) Tlb Nees) 
= Bru _ 

' = 0.226 ag: 

2 _g \/_1lb \/2.54 cm)? 

u a (2.70-£5)(5 -)( iain ) 
= 0.0975. 

ir 


_ cp(Ax? _ (0.226)(0.0975)(1? _ 
Are "ak (40.0291) 8S 


For the ratio of values of Ar and Ax that we have chosen, our differential equation 
becomes 


1 1 
= qian + ui_)+ 3M 


which is to be applied at every point where u is unknown. In this example, this includes 
- the points at x = 0 and x = 4 as well as the interior points. To enable us to write Eq. 
(8.9), we extend the domain of u one step on either side of the boundary. Let xp be a 
fictitious point to the right of x = 4. and let x, be a fictitious point to the left of x = 0. 
If x, signifies x = 0, and if x; signifies x = 4, we have the relations, from Eq. (8.9) 


Cie iwi + uj) +5 1 


ufl= fuk +n) + Sus, 


Now we use the boundary conditions to eliminate the fictitious points, writing them _ 
as central-difference quotients: 
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_, ou Soaps ua — uy) _ _ 0.15 F = 70): 
i (0.00291)( a )= rg wh — 70); 
a ee Wa — uh) _ 

> (0.00291)( = =o. 


The 144 factor in the first equation changes h to a basis of in?. Solving for uy, and 
Up, we have 


Uy = Uy — 0.716u, + 50.1, 
Up = Uy, 


and the set of equations that give the temperatures becomes 


0.32u) + duh + 12.525, 


A similar treatment of the boundary conditions using the Crank—Nicolson method 
with r = k At/cp(Ax)* = 1 leads to the set of simultaneous equations: 


4.716w)7! — 2u! —0.716u, + 2s + 100.2, 


—yit! itt _ 
uy + 4u5 us, 


For the set of equations in (8.11), Ar will be 7.56 sec. We advance the solution one time 
step at a time by repeatedly solving either (8.10) or (8.11), using the proper values of 
uw oe 
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8.4 STABILITY AND CONVERGENCE CRITERIA 


We have previously stated that in order to ensure stability and convergence in the explicit 
method, the ratio r = k Ar/cp(Ax)? must be ; or less. The implicit Crank—Nicolson 
method has no such limitation. In this section, these phenomena and criteria will be 
studied in more detail. 

By convergence, we mean that the results of the method approach the analytical 
values as Ar and Ax both approach zero. By stability, we mean that errors made at one 
stage of the calculations do not cause increasingly large errors as the computations are 
continued, but rather will eventually damp out. . 

We will first discuss convergence, limiting ourselves to the simple case of the 
unsteady-state heat-flow equation in one dimension:* 


(8.12) 


Let us use the symbol U to represent the exact solution to Eq. (8.12), and u to represent 
the numerical solution. At the moment we assume that u is free of round-off errors, so 
the only difference between U and wu is the error made by replacing Eq. (8.12) by the 
difference equation. Let e/ = U/ — uw}, at the point x = x;, ¢ = 4. By the explicit 
method, Eq. (8.12) becomes 


uf*! = r(ud,, + wt_,) + 1 — 2nud, (8.13) 
where r = k At/cp(Ax)*. Substituting u = U — e into Eq. (8.13), we get 
eft! = r(ei,, + ef_,) + (1 — ed — r(U4,,+ U4_,) — GA — 2nUs + uit! (8.14) 


By using Taylor-series expansions, we have 


au (Ax)? a7U(E,, 5) 
-ui + (5) Ax +S Saererar m3 Se ra 
Ww 
Ax)? a7U(&, ) 
of — (2), ax SED 
iy 


ae 5 a %-1 5 &<%, 


j 


= UL + At GS n< ter 


*We could have treated the simpler equation 4U/aT = 47U/aX? without loss of generality, since with the change 
of variables X = Vcp x, T = &t, the two equations are seen to be identical. 
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Substituting these into Eq. (8.14) and simplifying, remembering that r(Ax)? = k Ar/cp, 
we get 


AU(x,, nk A7ULE, t)) 
ar 2 


= nF ber, Rages S = Xxic 


eft! = r(ed,, + ef_,) + (1 — Qed + | 


Let E/ be the magnitude of the maximum error in the row of calculations for t = ty and 
let M > 0 be :20 upper bound for the magnitude of the expression in brackets in Eq. 
(8.15). If r S 5, all the coefficients in Eq. (8.15) are positive (or zero) and we may write 
the inequality 


e/*)| < rE) + (1 — 2NE7 + MAr=E) +M At. 
This is true for all the e/*!at f= t), 4, so 

E/*! = BF) + M At. 
Since this is true at each time step, 


EM SE +M Ars E' + 2M Atrs-++ = Eo + Gj + 1M At = Eo + Mts, 
= Mi,1, 
because E°, the errors at f = 0, are zero, since U is given by the initial conditions. 


Now, as Ax = 0, Ar 0 if k Ar/cp(Ax? = i, and M — 0, because, as both Ax 
and Ar get smaller, 


a cp axe 


[es nk PUlE, 4) 


This last is by virtue of Eq. (8.12), of course. Consequently, we have shown that the 
explicit method is convergent for r = 5, because the errors approach zero as Ar and Ax 
are made smaller. 

For the solution to the heat-flow equation by the Crank—Nicolson method, the analysis 
of convergence may be made by similar methods. The treatment is more complicated, 
but it can be shown that each E’*! is no greater than a finite multiple of E’ plus a term 
that vanishes as both Ax and Ar become small, and this is independent of r. Hence, since 
the initial errors are zero, the finite-difference solution approaches the analytical solution 
as At > 0 and Ax — 0, requiring only that r stay finite. 
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Table 8.5 


Table 8.6 


Let us begin our discussion of stability with a numerical example. Since the heat- 
flow equation is*lear, if two solutions are known, their sum is also a solution. We are 
interested in what happens to errors made in one line of the computations as the calculations 
are continued, and because of the additivity feature, the effect of a succession of errors 
is just the sum of the effects of the individual errors. We follow, then, a single error,* 
which most likely occurred due to round-off. If this single error does not grow in mag- 
nitude, we will call the method stable, since then the cumulative effect of all errors affects 
the later calculations no more than a linear combination of the previous errors would, 

Table 8.5 illustrates the principle. We have calculated for the simple case where 
the boundary conditions are fixed, so that the errors at the endpoints are zero. We 
assume that a single error of size e occurs at f = t, and x = x». The explicit method, 
k At/cp(Ax)? = a was used. The original error quite obviously dies out. As an exercise, 
it is left to the student to show that with r > 0.5, errors have an increasingly large 
effect on later computations. Table 8.6 shows that errors damp out for the Crank—Nicolson 
method with r = 1 even more rapidly than in the explicit method with r = 0.5. 


Propagation of errors—explicit method 


Endpoint Endpoint 
t xy x XS Xy Xs 
ty 0 0 0 0 
h 0 e 0 0 0 
b 0 0 0.506 0 0 
i 0 0.25e 0 0.2Se 0 
ty 0 0 0.25e 0 0 
Is 0 0,125e 0 0.12Se 0 
Ie 0 0 0.125e 0 0 
b 0 0.062e 0 0.062e 0 
ty 0 0 0.062e 0 0 


it x) Xp x X4 Xs 
lo 0 0 0 0 0 
i, 0 e 0 0 0 
b 0 0.071e 0.2866 0.071e 0 
be 0 0.107e 0.143e 0.107e 0 
& 0 0.049e 0.0926 0.049e 0 
ts 0 0.0366 0.053e 0.036e 0 
ts 0 0.022¢ 0.033 0.022e 0 
is 0 0.013e 0.020e 0.013e 0 
G 0 0.008e 0.013e 0.008¢ 0 


*A computation made assuming that each of the interior points has an error equal to € at f = 1, demonstrates 
the effect more rapidly. 
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In order to discuss stability in a more analytical sense, we need some material from 
linear algebra. In Chapter 6 we discussed eigenvalues and eigenvectors of a matrix. We 
recall that for the matrix A and vector x, if 


Ax = dx, 


then the scalar A is an eigenvalue of A and x is the corresponding eigenvector. If the NV 
eigenvalues of the N x N matrix A are all different, then the corresponding N eigenvectors 
are linearly independent, and any N-component vector can be written uniquely in terms 
of them. 

Consider the unsteady-state heat-flow problem with fixed boundary conditions. Sup- 
pose we subdivide into N + | subintervals so there are N unknown values of the tem- 
perature being calculated at each time step. Think of these N values as the components 
of a vector, Our algorithm for the explicit method (Eq. (8.4)) can be written as the matrix 
equation* 


or 
wt! = Au, 


where A represents the coefficient matrix and wv and w*! are the vectors whose NV 
components are the successive calculated values of temperature. The components of u° 
are the initial values from which we begin our solution. The successive rows of our 
calculations are 


ul =Au°, 


u? = Au! = Atv, 
Ww = Au! = AW? = +++ = Aly, 
(The superscripts on the A are here exponents; on the vectors they indicate time.) 


*A change of variable is required to give boundary conditions of « = 0 at each end. This can always be done 
for fixed end conditions. 
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Suppose that errors are introduced into u°, so that it becomes #°. We will follow 
the effects of thiS error through the calculations. The successive lines of calculation are 
now 

=A =.--- = Alp? 


Let us define the vector e/ as u/ — a so the e/ represents the errors in u/ caused 
by the errors in 4°. We have 


e = Ww — a) = Aw — Ain = Ale®. (8.17) | 


This shows that errors are propagated by using the same algorithm as that by which the 
temperatures are calculated, as was implicitly assumed earlier in this section. 
Now the N eigenvalues of A are distinct (see below) so that its N eigenvectors x, 


X),... , Xy are independent, and 
Ax, = A\X, 
Ax = zx, 
Axy = Ayxy- 


We now write the error vector e° as a linear combination of the x,: 
© = cx, + cox) +++ + + Cyy, | 
where the c’s are constants. Then e! is, in terms of the x,, | 
N | 
e! = Ae? = 2 Ac,x, = >, cx; p> CjAjX). | 


and for e, 


(Again, the superscripts on vectors indicate time; on A they are exponents.) After j steps, _ 
Eq. (8.17) can be written | 
N 
a= > cj AAx;- 
be! | 
If the magnitudes of all of the eigenvalues are less than or equal to unity, errors will 
not grow as the computations proceed; that is, the computational scheme is stable. This 
then is the analytical condition for stability: that the largest eigenvalue of the coefficient | 
matrix for the algorithm be one or less in magnitude. 
The eigenvalues of matrix A (Eq. (8.16)) can be shown to be (note they are al 
distinct): 


é. Ce ee LE = 
1 — 4r sin aN+1 =" Te teary AS 
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We will have stability for the explicit scheme if 


& = 4h ee 
1=1-— 4rsin Ww+p=! 


The limiting value of r is given by 
<9 AT 
-1=1-4r sin XN +1) 


1/2 
PS s35-— ana i 
sin? (n7/2(N + 1)) 


Hence, if r= 4, the explicit scheme is stable. 
The Crank—Nicolson scheme, in matrix form, is 


(2 + 2r) =F Fu (2 — 2r) r ul 


-r  (2+2”) =r wi"! r (2 - 2r)r uw, 


“3 2, j+1 2-2 J 
r (2 + 2r)|| uh r (2— 2r)| | uy 


or 
Aw*! = Bu. 


We can write 


ut! = (A“'Byw!, (8.18) 


so that stability is given by the magnitudes of the eigenvalues of A~'B. These are 


2 — 4r sin? (nz7/2(N — 1)) 


2 + 4r sin? (nzz/2(N — 1))° = Updpee oe Ns 


Clearly all the eigenvalues are no greater than one in magnitude for any positive value 
of r. 
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With derivative boundary conditions, a similar analysis shows that the Crank—Nico 
son method is stable for any positive value of r. For the explicit scheme, r = 
leads to instability with a finite surface coefficient. Smith (1978) shows that the limitatio 
on r for stability is 


where P is the ratio of surface coefficient to conductivity, h/k. 


8.5 PARABOLIC EQUATIONS IN TWO OR MORE DIMENSIONS 


In principle, we can readily extend the preceding methods to higher space dimensions, 
especially when the region is rectangular. The heat-flow equation in two directions is 


uk (ae , 
ar calae oy 


Taking Ax = Ay, and letting r = k Ar/cp( Ax), we find that the explicit scheme becomes 


— _ 4 St cia 
uh} = uy = rhea — Qed + afi F ub je — Qed + yD) 


In this scheme, the maximum value permissible for r in the simple case of constant end 
conditions is ; (Note that this corresponds again to the numerical value that gives a 
particularly simple formula.) In the more general case with Ax + Ay, the criterion is 


See SIA 
cpl(Ax)* + (AyP] 8 


The analogous equation in three dimensions, with equal grid spacing each way, has the 
coefficient (1 — 6r), and r = z is required for convergence and stability. 
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The difficulty with the use of the explicit scheme is that the restrictions on Ar require 
inordinately many rows of calculations. One then looks for a method in which Ar can be 
made larger without loss of stability. In one dimension, the Crank—Nicolson method was 
such a method. In the two-dimensional case, using averages of central-difference approx- 
imations to give d?u/dx? and 4°u/dy? at the midvalue of time, we get 


i+) ij +1, j 


iy + + 
slubt}  — Qube} + uk} 5 + uf — uk, + ub; 


kel ay kth 4 kt k= Oyk k 
Fury — Qu + uty Hu je — Query + ue ja) 


The problem now is that a set of (M)(N) simultaneous equations must be solved at 
each time step, where M is the number of unknown values in the x-direction and N in 
the y-direction. Furthermore, the coefficient matrix is no longer tridiagonal, so the solution 
to each set of equations is slower and memory space to store the elements of the matrix 
becomes exorbitant. 

The advantage of a tridiagonal matrix is retained in the alternating-direction-implicit 
scheme (A.D.I.) proposed by Peaceman and Rachford (1955). It is widely used in modern 
computer programs for the solution of parabolic partial-differential equations. In this 
method, we approximate V?u by adding a central-difference approximation to d7u/d 
written at the beginning of the interval to a similar expression for du/dy? written at the 


end: 
k+ k= ryt = 2k k k+l aykt) 4 kth 
uk} — uf = rhuteay — uty F ufiay + abt — Quht} + fh). (8.19) 
A 
From d7u/ dx at start From 67u/dy? at end 


The obvious bias in this formula is balanced by reversing the order for the second- 
derivative approximations in the next time span: 


kA2  ykt+) = 3 k+2 A+2 i _ Ard A+ 
up? — uty = list — Que + uth + a uj tj + uefa. (8.20) 
pt a 2 eal 
From 67u/dx? at end From 47u/ay? at start 


We illustrate the A.D.I. method with a very simple example. 


=X AMPLE A square plate of steel is 15 cm on a side. Initially, all points are at 0°C. Follow the 
interior temperatures if two adjacent sides are suddenly brought to 100°C and held at that 
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teqperatare. The plese is imselated on is fiat sexfaces_ so that beat flows only 

x- and »-darectoaS “We ae given that E = 0.13. p = 7.8. c = 0.11 mes umes T 

Ax = Ay = 5 coy so tht there ae for tere pots. Label the pomes 2s 2 

Fig. 8.6. wath w bemme esed for bocizontadl traverses, v for wertical traverses. 
Equations (3.19) and (8.20) become, with r = & Ar/cp(AxF_ 
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Table 6.7 Temperature changes in 2 15 * 15 cm stee! plate, computed by 


the ADL method 
Tempersmares 2 
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2 1s. 37568 6138 11s 37788 
132.0 959 64.017 15.901 333 
_ 1485 41.758 66159 17.410 41.311 
165.0 43.278 <7 ase 18683 3.278 
11 s* 26.485 @25 19.776 44.512 
198.0 45.500 70.318 2683 45.500 
- 245° 46314 71.197 21.490 632! 
210 88 71.907 Doe pees 
275" 46533 72.482 2589 os 
268.0 me 72988 B09 a7 98 
= 50.00 0 300 50.00 


Figure 8.6 


Figure 8.7 
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(Note the similarity to the equations for the A.D.I. method for steady-state temperatures 
in Section 7.11. This is hardly surprising, since the temperatures for the unsteady-state 
problem will eventually reach the steady state. One can think of this as the rationale 
behind the A.D.I. method for elliptic equations.) 

We solve for the temperature history by using the above equations in succession. 
One normally discards the alternate computations, beginning with the first, because they 
tend to be inaccurate. 


Table 8.7 and Fig. 8.7 show the results, with At = 16.5 sec, corresponding to r = 


0.1. 


80 - 


u, OF Ys 


Temperature, °C 
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The compensation of errors produced by this alternation of direction gives a 
that is convergent“and stable for all values of r, although accuracy requires that r not 
too large. The three-dimensional analog alternates three ways, returning to each of the — 
three formulas after every third step. (Unfortunately the three-dimensional case is not 
stable for all fixed values of r > 0. A variant due to Douglas (1962) is unconditionally — 
stable, however.) When the formulas are rearranged, in each case tridiagonal coefficient 
matrices result. 

Note that the equations can be broken up into two independent subsets, each con- 
taining only two equations. This is always true in the A.D.I. method; each row gives a 
set independent of the equations from the other rows. For columns, the same thing occurs. 
For very large problems, this is important, because it permits the ready overlay of main 
memory in solving the independent sets. 

When the region in which the heat-flow equation is to be satisfied is not rectangular, 
one may perturb the boundary to make it agree with a square mesh or interpolate from 
boundary points to estimate u at adjacent mesh points as discussed in Chapter 7 for elliptic 
equations. , 

The frequency with which circular or spherical regions occur makes it worthwhile 
to mention the heat equation in polar and spherical coordinates. The basic equation 


eed lau 
a ocp\ér rér rae/’ 


au _ kau , 2a, au cordmu , 1 du) 
yeaa rar rae r 40 rsin*ad7)" 


Using finite-difference approximations to convert these to difference equations is straight- 
forward except at the origin where r = 0. For this point, consider V7u in rectangular 
coordinates. so that. in two dimensions, 


psy 5 F Mag t Minje1 + uj, 5-1 — 48,5 
(ar? 
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This is exactly the same as the expression for the Laplacian in Chapter 7, Eq. (7.6a). 
This expression for V7u is obviously independent of the orientation of the axes. We get 
the best value by using the average value of all points that are a distance Ar from the 
origin, so that for r = 0, 


A(ugy — Uo) 


ue oF 


atr = 0. 


The corresponding relation for spherical coordinates is 


— 6(gy = Mo) 


Vu 
(Ar? 


atr = 0. 


FINITE ELEMENTS FOR HEAT FLOW 


In Chapters 6 and 7 we observed that the finite-element method often is preferred for 
obtaining approximate solutions to differential equations whose conditions are specified 
on the boundaries of a region. They are also an important alternative to finite differences 
in solving parabolic equations. You should have some exposure to this application of 
finite elements, but we do not have space enough to give a full treatment. 

Consider first the one-dimensional problem that we discussed in Sections 8.1-8.3 
The equation for this is 


du _ k a?u 


— = 5 over [x,,.4 
a cpae Bin ta) 


subject to initial conditions that give u(x, 0) and boundary conditions that give u(x,, 1) 
and u(x,, £). It is possible to consider this a two-dimensional problem (in x and 1) and 
treat it as in Section 7.12. However, it is customary to apply finite elements only to the 
x-domain and approximate the time derivative by finite differences. We will adopt this 
approach, using a forward-difference approximation, so Eq. (8.21) becomes 


Superscripts on w indicate time; subscripts on wu in the following indicate position. 

We break the region of the problem into discrete contiguous subintervals and approx- 
imate u within element (e), whose endpoints are x, and x,,), aS a linear interpolation of 
uz and u,,,. In other words, 


ul) = Nu + Nes ites within (e) 
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This is exactly what we did in Section 6.4. Of course in our present application, all of 
the u-values are funetions of r. We assume that 1 has some specified value, say f,,, for 
the present. As explained in Chapter 6, 


These shape functions are identically zero outside element (e). Clearly N(x.) = 1, 
Nevilee1) = 1, MOq+1) = 0, and N+) = 0. Also, dN,/dx = —1/h and 
dN, ;/dx = 1/h (where h = x,,, — x,) within (e) and both derivatives are zero outside 
of (e). 

To find an approximation to uv at the nodes, the finite-element method evaluates a 
set of integrals to get the coefficients of a set of linear equations whose variables are the 
unknown nodal values. For Eq. (8.22), the integral for node in element (e) at time 1,, 
is (this is usually obtained by the Galerkin procedure): 


mAN; aN, 
"Gx dx 


“N, de + 


fe) un i? u™ 
i ‘ 
At cp 


k [°( nN, aN; 


wes ) ar. (8.23) 


In Eq. (8.23), j refers to the other node of element (e). (If there is a derivative boundary 
condition or if there are heat sources or sinks within (e), this nodal integral is more 
complicated.) 

Since our shape functions are linear, the integrals are easy to evaluate: 


(e) he 
| Ndx=—-, h® = width of element, 
raed al WAP iT =i, 
dx }\ dx —1/h if i + j. 
Consider two adjoining elements, (e) and (e + 1), that are joined at node n.The 


coefficients of the system are found by adding the nodal integrals for node n from both 
elements. 


,k (dN, dN, 
d+ ug | eda 


dx 
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Evaluating the integrals and assuming the width of both elements is h, we get 
antl — “a(5) et | milk i) wat! — ym 5) 
Ar 2, * Mn-\ op h mel % At 2 


mk (1 m (a1) 
“bia xh) + here (Fp) = 0. 


Collecting terms, putting those at the new time 1,,,, on the left and those at 1,, on the 
right, gives 


k Ar 
yntl — 
unt t= unt cp ni Be a Wag De 


This is exactly the same as Eq. (8.4), so we see that, if equal-sized elements are used 
and if the time derivative is approximated by a forward difference, the one-dimensional 
finite-element procedure reduces to the explicit finite-difference method. If different 
approximations are used for the time derivative, we can get implicit methods such as 
Crank—Nicolson. Of course, one often varies the size of elements to get greater precision 
in some subintervals. This is where finite-element formulations shine. 

We have applied the finite-element method to only two contiguous elements. Extend- 
ing to many simplex elements is easy. When we reach a boundary of the region, the 
value of u will be known, or is specified in terms of space derivatives. (Remember that 
derivative conditions may add terms to the element equations.) Of course, we start off 
in the usual manner, employing the known values of u given by the initial conditions, 
There is a limit to the size of Ar that can be used without instability, exactly as with finite 
differences. 

For heat flow in two space dimensions, we can parallel the above development. Eq. 
(8.23) becomes, for node i in triangular element (e) at time J,,, and using a forward- 
difference approximation for du/dt: 


Ce) yin tt a un 
fl aN ae dy 
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+ ) dx dy 


*) dx dy. (8.25) 


In Eq. (8.25), we assume that element (e) has nodes i, j, k. There will be additional 
contributions from integrals of similar form to the coefficients of u; and the right-hand 
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8.7 


side of row i from all other elements that share node i. Row i will also have terms for 
the columns that correspond to the other vertices of elements that join at node i. 

Again, because we have used shape functions that are linear in x and y, the integrals 
are easy to evaluate: 


(e) 
I N, dx dy = A“ /3, where A“ is the area of (e), 


© aN, wie 
I] ox gens 


(e) ON, Lie 
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The extension of the finite-element method to time-dependent heat flow in a two- 
dimensional region using triangular elements, while straightforward, is computationally | 
tedious. Setting up the matrices when there are many elements and many nodes really | 
requires a computer. 

Equation (8.25), as written, gives an explicit formula. Modifying the way the time | 
derivative is approximated may lead to more accurate implicit methods that are less | 
stringent on the size of Ar. 


CHAPTER SUMMARY 


Test your understanding of this chapter by asking yourself if you can 


1. Solve a parabolic partial-differential equation in one space dimension by both the 
explicit and implicit (Crank—Nicolson) methods with boundary conditions that may f 
involve derivatives. 

2. Explain the differences between explicit and implicit methods, giving the advantages 
of each. 


3. Outline the arguments that show if a method is stable and convergent. 


= 
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4. Discuss how finite-difference methods can be applied to parabolic equations in two 
and three space dimensions, describing the problems that are involved. 


5. Set up the equations for the A.D.1. method in a two-dimensional problem, and employ 
a computer program for this on a rectangular region. You can outline how the method 
might be applied to an irregular region. 


6. Explain how finite elements can be applied to parabolic equations and solve simpler 
situations with this method. You should be able to tell when this method reduces to 
the same equations as with a finite-difference method. 


7. Use, modify, and critique computer programs that solve parabolic equations. 


COMPUTER PROGRAMS 


Three computer programs are presented as examples of how parabolic partial-differential 
equations in one or two space dimensions can be solved on a computer. Program | (Fig. 
8.8) uses the explicit method of Section 8.1; Program 2 (Fig. 8.9) employs the implicit 
Crank—Nicolson method of Section 8.2; Program 3 (Fig. 8.10) solves the unsteady-state 
heat-flow equation in two space dimensions by the A.D.1. method. In the first program, 
both ends are held at a known temperature, and these end temperatures are allowed to 
vary with time. In the second, however, derivative end conditions are permitted. 

For Program |, the fundamental difference equation that approximates the differential 
equation is Eq. (8.4): 


wi*" = rw, + w_,) + UL — Aru, 


_ kar 
cp(Ax)" 


This relation is applied at each interior point. The end conditions, w(0, 1) and u(L, 1), 
are defined by arithmetic statement functions. At the end of each time step, the interior 
temperatures are computed and a line of temperatures is printed. 

The program is tested by computing the temperatures in a 10-cm-long bar with 
k = 0.53, ¢ = 0.226, p = 2.70, Ax = 1.0, r = 0.25 (all in egs units). The bar is ini- 
tially at 20°C at all points, and temperatures are caused to change by suddenly cooling 
one end to 0°C and heating the other end to 100°C. The program computes values until 
T > 20, but only part of the output is shown. 

Program 2, using the Crank—Nicolson method, is more elaborate in that derivative 
end conditions are permitted, in the form 


ou 
au + b—-=Cc, 
ox 


where w is the temperature at the end of the bar. This equation can simulate the loss of 
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heat by convection or by imperfect insulation at each end. The program generates the i 
coefficients of the set of simultaneous equations that are to be solved, which are, when i 
R=, 

At left end: (4 — 2a Ax)uj*! — 2us*! = 2a Axut + 2u4 — 4v Ax; 

Interior points: = —w/t) + 4ui*! — wi, = ui_, + wi); 

At right end: —2udt! + (4 + 2a Axyui*}, = Qui, — 2a Axul,, + 4v Ax. 


In these relations, a = —a/b, v = c/b. The coefficients are compressed into an 
n * 4 array and the tridiagonal system is solved by first finding the LU decomposi- 
tion. Temperatures are printed after each time step. The program calculates temperatures 
until T = 1000 for a bar of length 20 cm whose initial temperatures are given by u = 
100 — 10|x — 10), with the left end condition of u, = 0.2(u — 15) and right end condi- 
tion of wu = 100. Partial output is shown. 

Program 3 illustrates how the A.D.I. method can be implemented with a computer 
program. Two vectors are used, U and V, to hold the function values as calculated by 
the alternating horizontal and vertical traverses. The program sets up the coefficient 
matrices for each traverse, computes the right-hand sides from the last computed values 
of U and V, and solves the system using LU decomposition. Function values are output 
only after each second set of computations because the odd traverses are less accurate. 
The output for this program is for a rectangular region with 105 interior nodes when the 
boundary is held at 0 on three sides and at 100 on the fourth side. Initially, the potential 
function is 0 at all points. 


PROGRAM EXPLPA (INPUT, OUTPUT) 


THIS PROGRAM COMPUTES ONE DIMENSIONAL UNSTEADY STATE POTENTIAL 
FLOW PROBLEMS BY THE EXPLICIT METHOD. THE POTENTIAL AT EACH END 
IS SPECIFIED AS A FUNCTION OF TIME THROUGH ARITHMETIC STATEMENT 
FUNCTIONS. 


PARAMETERS ARE : 


u,v - VALUES OF THE POTENTIAL AT NODES 

Xe - TIME 

TF - FINAL TIME FOR WHICH VALUES ARE COMPUTED 
DT - DELTA T 

LEN - LENGTH 

DX - DELTA X 

N - NUMBER OF X INTERVALS 

K - CONDUCTIVITY 

Cc - HEAT CAPACITY 

RHO - DENSITY 


RATIO - RATIO OF K(DT)/C(RHO) (DX) **2 


aagaARAAAAAAAAAAAAAAAAAAAA 


Figure 8.8 Program 1. 
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igure 8.8 (continued) 


aaaaa 


aaaana 


aaaaana 


aaaaa 


FLFT - BOUNDARY CONDITION ON LEFT END 
FRT - BOUNDARY CONDITION ON RIGHT END 


REAL U(500) ,V (500), LEN, K,T, TF, DT, DX, C, RHO, THALF, RATIO 
INTEGER N,NP1, I 


WE DEFINE SOME CONSTANTS WITH A DATA STATEMENT 


DATA T,TE,LEN,N, K,C, RHO, RATIO/0.0,20.0,10.0,10,0.53,0.226, 
+ 2.70,0.25/ 


READ IN THE INITIAL TEMPERATURES AND WRITE THEM OUT 


NPl1=N+1 

DX = LEN / N 

DT = RATIO*C*RHO*DX*DX / K 

READ *, ( U(I), I = 1,NP1 ) 

PRINT 201, T,LEN,DX,T, ( U(I), I = 1,NP1 ) 


WE GET THE POTENTIAL FROM THE EXPLICIT RELATION, PRINTING EACH 
SET OF VALUES AS THEY ARE COMPUTED. TWO ARRAYS ARE USED, 
HOLDING ALTERNATE SETS OF VALUES. 


10 THALF = T + DT/2.0 
U(1) = FLFT(THALF) 
U(N+1) FRT (THALF) 
DO 20 I 2,N 
V(I) = RATIO * ( U(I+1) + U(I-1)) + (1.0 - 2.0*RATIO )*U(I) 
20 CONTINUE 


T=|T + DT 
V(1) = FLFT(T) 
V(NP1) = FRT(T) 


PRINT 202, T,( V(I), I = 1,NP1 ) 
IF i) € .GT. Te") STOP 


NOW DO SECOND SET OF VALUES 


THALF = T + DT/2.0 


V(1) = FLFT(THALF) 

V(N+1) = PRT (THALF) 

DO 30 I = 2,N 

U(I) = RATIO * ( V(I+1) + V(I-1)) + (1.0 -2.0*RATIO )*V(I) 

30 CONTINUE 

T= + DT 

U(1) = FLET(T) 

U(NP1) = FRT(T) 


PRINT 202, T,( U(I), I = 1,NP1 ) 
IF ( T .GT. TF ) STOP 
GO TO 10 
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Figure 8.8 (continued) 


pen 
201 FORMAT(/’ POTENTIAL VALUES IN ONE DIMENSION BY ’, 
+ ‘EXPLICIT METHOD ‘/1X,’ FOR X = ',F4.1," TO X=‘, 
+ F4.1,’ WITH DELTA X OF ',F6.3//1X,’ AT T = ',F6.3, 
+ / (1X, 6F9.3) ) 
202 FORMAT(/1X,’ VALUES AT T = ',F8.3/(1X,6F9.3) ) 
END 


FUNCTIONS DEFINED FOR THE LEFT AND 
RIGHT HAND BOUNDARIES. 


aaaaaaaa 


REAL FUNCTION FLFT(X) 
REAL X 

FLFT = 0.0 

RETURN 

END 


REAL FUNCTION FRT(X) 
REAL X 

FRT = 100.0 

RETURN 

END 


OUTPUT FOR PROGRAM 1 


POTENTIAL VALUES IN ONE DIMENSION BY EXPLICIT METHOD 
FOR X = 0.0 TO X = 10.0 WITH DELTA X OF 1.000 


AT T= 0.000 
20.000 20.000 20.000 20.000 20.000 20.000 
20.000 20.000 20.000 20.000 20.000 


VALUES AT T = +288 
0.000 15.000 20.000 20.000 20.000 20.000 
20.000 20.000 20.000 40.000 100.000 


VALUES AT T = -576 
0.000 12.500 18.750 20.000 20.000 20.000 
20.000 20.000 25.000 50.000 100.000 


VALUES AT T = 863 
0.000 10.938 17.500 19.688 20.000 20.000 
20.000 21.250 30.000 56.250 100.000 


VALUES AT T = U.D5z 
0.000 9.844 16.406 19.219 19.922 20.000 
20.313 23.125 34.375 60.625 100.000 


VALUES AT T = 1.439 
0.000 9.023 15.469 18.691 19.766 20.059 
20.938 25.234 38.125 63.906 100.000 


**x*** = ~QUTPUT CONTINUED ***** 
VALUES AT T = 18.997 


0.000 7.742 15.699 24.065 33.002 42.616 
52.954 63.988 75.621 87.694 100.000 
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Figure 8.8 (continued) 


VALUES AT T = 19.285 
0.000 7.796 15.801 24.208 33.172 42.797 
53.128 64.138 75.731 87.752 100.000 


VALUES AT T = 295.572: 
0.000 7.848 15.902 24.347 335337 42.973 
53.298 64.284 75.838 87.809 100.000 


VALUES AT T = 19.860 
0.000 7.900 16.000 24.483 33.499 43.145 
53.463 64.426 75.942 87.864 100.000 
VALUES AT T = 20.148 
7.950 16.096 24.616 33.656 43.313 
ae 


64.564 76.043 87.918 100.000 


PROGRAM CNIMPL(INPUT, OUTPUT) 


THIS PROGRAM COMPUTES UNSTEADY STATE POTENTIALS IN ONE 
DIMENSION USING CRANK-NICHOLSON IMPLICiT METHOD. THE POTENTIAL 
AT EACH END IS DETERMINED BY A RELATION OF THE FORM : 


AU + BU’ =C 


PARAMETERS ARE 


- VALUES OF POTENTIAL AT NODES 
- TIME 
FINAL TIME VALUE FOR WHICH SOLUTION IS DESIRED 
DELTA T 
LENGTH 
DELTA X 
NUMBER OF X INTERVALS 
DIFFUSIVITY, K/C*RHO FOR HEAT FLOW 
RATIO OF DT (DIF) / (DX) *2 
COEFFICIENT MATRIX FOR IMPLICIT RELATIONS 


aaAAAAAAAAANAAAAAAAAAAAAAA 


REAL U(500),COEF (500, 3),RHS(500),LEN,T,TF,DIF,RATIO 
, AL, BL, CL, AR, BR, CR, DX,DT 
INTEGER N,NP1,I 


DEFINE SOME CONSTANTS 


DATA T,TF,LEN,N, DIF, RATIO/0.0,1000.0,20.0,20,0.119,1.0/ 
DATA AL, BL,CL/-0.2,1.0,-3.0/ 
DATA AR, BR,CR/1.0,0.0,100.0/ 


Figure 8.9 Program 2. 
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Figure 8.9 (continued) 


~~ 
c 
C READ IN INITIAL VALUES 
¢ 


NPl =N#1 

READ*, ( U(I), I = 1,NP1 ) 

DX = LEN / N 

DT = RATIO*DX*DX / DIF 

PRINT 201, T,LEN,DX,T,( U(I), I = 1,NP1 ) 


ESTABLISH COEFFICIENT MATRIX 


aa000a 


IF ( BL .NE. 0.0 ) THEN 
COEF (1,2) = 2.0/RATIO + 2.0 - 2.0*AL*DX/BL 
COEF (1,3) = -2.0 
ELSE 
COEF (1,2) 
COEF (1,3) 
END IF 
20 DO 25 I = 2,N 
COEF (I,1) = -1.0 
COEF(I,2) = 2.0/RATIO + 2.0 
COEF (I,3) = -1.0 
25 CONTINUE 
IF ( BR .NE. 0.0 ) THEN 
COEF (N+1,1) = -2.0 
COEF (N+1,2) = 2.0/RATIO + 2.0 + 2.0*AR*DX/BR 
ELSE 
COEF (N+1,1) = 0.0 
COEF (N+1,2) = 1.0 
END iF 


GET THE LU DECOMPOSITION TO PREPARE FOR SOLVING EQUATIONS 


aaaqaa 


40 DO 50 = = 2,NPi 

COEF (I-1,3) = COEF(I-1,3) / COEF(I-1,2) 

COEF (I,2) = COEF(I,2) - COEF(I,1) *COEF(I-1,3) 
50 CONTINUE 


ESTABLISH THE R.H.S. VECTOR - FIRST THE TOP AND BOTTOM ROWS 


aaaaa 


55 IF ( BL .NE. 0.0 ) THEN 
RHS(1) = ( 2.0/RATIO - 2.0 + 2.0*AL*DX/BL ) * U(1) + 
+ 2.0*U(2) - 4.0*CL*DX/BL 
ELSE 
RHS(1) = CL / AL 
END IF 
60 IF ( BR .NE. 0.0 ) THEN 
RHS(N+1) = 2.0*U(N) + ( 2.0/RATIO - 2.0 - 2.0*AR*DX/BR ) * 
= U(N+1) + 4.0*CR*DX/BR 
ELSE 
RHS(N+1) = CR / AR 
END IF 
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Figure 8.9 (continued) 


c 
C NOW FOR THE OTHER ROWS OF THE RHS VECTOR 
c 
DO 100 I = 2,N 
RHS(I) = U(I-1) + ( 2,0/RATIO - 2.0) * U(I) + U(I+1) 
100 CONTINUE 
c 
I eg 
c 
C WE GET THE SOLUTION FOR THE CURRENT TIME STEP 
i 
U(1) = RHS(1) / COEF(1,2) 
DO 110 I = 2,NP1 
U(I) = ( RHS(I) - COEF(I,1)*U(I-1) ) / COEF(I,2) 
110 CONTINUE 
DO 120 I =1,N 
JROW=N-1I+1 
U(JROW) = U(JROW) - COEF (JROW, 3) *U(JROW+1) 
120 CONTINUE 
c 
IS Rear a rer oe 
c 
C WRITE OUT THE SOLUTION JUST FOUND 
ie 
T= 7 + DT 
PRINT 202, T, ( U(I), I = 1,NP1 ) 
IF (T .LT, TF ) GO TO 55 
STOP 
c 
BR SebsdndesencannacdbewademaucuenianenenasiinnneceacinneasaaaSeaieanasea 
(a 
201 FORMAT(’ POTENTIAL VALUES IN ONE DIMENSION BY THE ', 
+ ‘IMPLICIT METHOD '/1X,’ FOR X = ',F4.1,’ 
+ F4.1,’ WITH DELTA X OF ',F6.3//1X,’ AT T = 
+ / (1X, 10F8.2)) 
202 FORMAT(1X,’ VALUES AT T = ',F8.3,/(1X,10F8.2)) 
END 


OUTPUT FOR PROGRAM 2 


POTENTIAL VALUES IN ONE DIMENSION BY THE IMPLICIT METHOD 
FOR X = 0.0 TO X = 20.0 WITH DELTA X OF 1.000 


0.00 10.00 20.00 30.00 40.00 50.00 60.00 
100.00 90.00 80.00 70.00 60.00 50.00 40.00 


VALUES AT T = 8.403 
13.46 13.61 20.97 30.26 40.07 50.00 59.95 
88.45 86.91 79.17 69.79 59.98 50.12 40.51 
100.00 
VALUES AT T = 16.807 
16,09 18.47 23.38 31.20 40.37 50.02 59.69 
84.61 82.31 77.01 69.02 59.94 50.82 42.86 
100,00 
VALUES AT T= 25.210 
19.28 21.15 25.87 32.65 40.98 50.04 59.15 
80.77 79.25 74.63 67.91 60.09 52.59 47.46 
100.00 


TOX =", 
“,F7.3, 
70.00 80, 
30.00 20. 
69.78 79. 
31,92 27. 
68.95 76. 
38.58 43. 
67.58 74. 
47.86 57. 


00 
00 
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18 
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Figure 8.9 (continued) In 
=s 
**** OUTPUT CONTINUED **** 


VALUES AT T = 983.193 
31.70 35.04 38.39 41.74 45.12 48.48 
65.46 68.88 72.32 Tho) 19527. 82.66 
100.00 Si 
VALUES AT T = 991.597 
31.70 35.05 38.40 41.75 45.11 48.48 
65.47 68.89 72.32 75.77 79.22 82.67 
100.00 
VALUES AT T = 1000.000 


31.71 35.05 38.40 41.76 45.12 48.49 = és 5 
65.48 68.90 i230 Sune  —79V22' 82.67 * ' 5 5 


100.00 


PROGRAM ADIUN (INPUT, OUTPUT) 


Cc 

Go eee eso ase een cae eee ea ee ee 
c 

C THIS PROGRAM SOLVES THE UNSTEADY STATE HEAT FLOW EQUATION IN TWO 

C SPACE DIMENSIONS USING THE A.D.I. METHOD. 

¢ 

ces ca a 
Cc 

c PARAMETERS ARE : 

¢ 

Cc U - VECTOR OF TEMPERATURES AFTER ODD TRAVERSES 

c v - TEMPERATURES AFTER EVEN TRAVERSES ( 
Cc UCOEF - MATRIX OF COEFFICIENTS FOR THE U‘S 

Cc VCOEF - MATRIX FOR THE V’S 

Cc BCNDU - VECTOR OF BOUNDARY VALUES FOR THE U EQUATIONS 

Cc BCNDV - SAME FOR THE V’S 

Cc TOP - VECTOR TO HOLD BOUNDARY VALUES ACROSS TOP OF REGION 

c BOT - SAME FOR THE VALUES ACROSS THE BOTTOM 

c RT - HOLD THE RIGHT HAND SIDE VALUES 

c LEFT - HOLD VALUES FOR THE LEFT HAND EDGE 

Cc M - NUMBER OF ROWS OF NODES IN THE GRID 

Cc N - NUMBER OF COLUMNS OF NODES 

c DIFF - THERMAL DIFFUSIVITY = K/C*DENSITY 

c H - THE VALUE OF DELTA X = DELTA Y 

€ TIME - THE TIME VARIABLE 

Cc TMAX —- MAXIMUM VALUE OF TIME FOR WHICH COMPUTATIONS ARE 

Cc DESIRED 

Cc DT - TIME STEP SIZE, RELATED TO DELTA X THROUGH R 

Cc 

(Gaeta eee ee ea a es ee cee 
Cc 

C ESTABLISH THE MATRICES AND VECTORS, AND PUT IN SOME VALUES WITH 

C DATA STATEMENTS. 

Cc 


REAL U(500) ,V(500) , UCOEF (500, 3) , VCOEF (500, 3) 
REAL BCNDU (500) , BCNDV (500) 

REAL TOP (100) ,BOT(100),LFT (100) ,RT(100) 
DATA UCOEF,VCOEF / 3000 * -1.0 / 

DATA BCNDU,BCNDV / 1000 * 0.0 / 

DATA M,N,R / 7,15,0.5 / 


Figure 8.10 Program 3. 


‘igure 8.10 (continued) 
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DATA H,DIFF / 0.125,0.152 / 
DATA TMAX / 20.0 / 


DATA RT,LFT,TOP,BOT / 100 * 100.0, 


300 * 0.0 / 


8.8; COMPUTER PROGRAMS 


FOR THE TEST CASE, WE BEGIN WITH TEMPERATURES EVERYWHERE EQUAL TO 


ZERO. ESTABLISH THESE BY A DATA STATEMENT. 


DATA Vv / 500 * 0.0 / 


SET UP THE COEFFICIENT MATRICES BY OVER-WRITING ON THE DIAGONAL 
AND CERTAIN OFF DIAGONAL TERMS. 


MSIZE = M* N 
DO 10 I = 1,MSIZE 
UCOEF(I,2) = 1.0/R + 2.0 
VCOEF(I,2) = 1.0/R + 2.0 
10 CONTINUE 
ML1 M-1 
NL1 Nom 2 
MSL1 = MSIZE - 1 
DO 20 I = N,MSL1,N 
UCOEF(I,3) = 0.0 
UCOEF(I+1,1) = 0.0 
20 CONTINUE 
DO 30 I = M,MSL1,M 
VCOEF (I,3) =0.0 
VCOEF (I+1,1) = 0.0 
30 CONTINUE 


NOW GET VALUES INTO THE BCND VECTORS 


Do 40 I = 1,N 
BCNDU(I) = TOP (I) 
J = MSIZE-N+1 
BCNDU(J) = BOT(I) 
40 CONTINUE : 
DO 45 I =1,M 
J = (I-1)*N +1 
BCNDU(J) = BCNDU(J) + LET(I) 
J=I*N 
BCNDU(J) = BCNDU(J) + RT(I) 
45 CONTINUE 
DO 50 I =1,M 
BCNDV(I) = LET (I) 
J = MSIZE-M+1I 
BCNDV(J) = RT(I) 
50 CONTINUE 
DO 55 I = 1,N 
J = (I-1)*M +1 
BCNDV(J) = BCNDV(J) + TOP (I) 
F=I*M 
BCNDV(J) = BCNDV(J) + BOT(I) 
55 CONTINUE 


WE NOW GET THE LU DECOMPOSITIONS OF UCOEF AND VCOEF 
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Figure 8.10 (continued) 
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DO 60 I = 2,MSIZE 
UCOEF (I-1,3) = UCOEF(I-1,3) /UCOEF(I-1,2) 
UCOEF(I,2) = UCOEF(I,2) - UCOEF(I,1) *UCOEF(I-1,3) 
VCOEF (I-1,3) = VCOEF(I-1,3) /VCOEF (I-1,2) 
VCOEF(I,2) = VCOEF(I,2) - VCOEF(I,1)*VCOEF (I-1,3) 
60 CONTINUE 


NOW WE DO THE ITERATIONS UNTIL TIME EQUALS TMAX. 


= 0.0 
DT = R / DIFF*H*H 
TIME .GT. TMAX ) STOP 


COMPUTE THE R.H.S. FOR THE U EQUATIONS AND STORE IN THE U VECTOR. 
FIRST DO THE TOP AND BOTTOM ROWS. 


Do 65 I = 1,N 
J = (I-1)*M + 1 
U(I) = (1.0/R - 2.0 )*V(J) + V(J+1) + BCNDU(I) 
K = MSIZE- N+I 


J=Ims 
U(K) = V(d-1) + ( 1.0/R - 2.0 )*V(J) + BCNDU(K) 
65 CONTINUE 


NOW FOR THE OTHER ONES 


DO 75 I = 2,ML1 
DO 70 J = 1,N 
K = (I-1)4N +9 
L= I+ (J-1)*M 
U(K) = V(L-1) + ( 1.0/R - 2.0 )*V(L) + V(L+1) + BCNDU(K) 
70 CONTINUE 
75 CONTINUE 


NOW GET THE SOLUTION FOR THE ODD TRAVERSE 


u(1) = U(1) / UCOEF(1,2) 
DO 80 I = 2,MSIZE 
U(I) = ( U(I) - UCOEF(I,1)*U(I-1)) / UCOEF(I,2) 
80 CONTINUE 
DO 90 I = 1,MSL1 
JROW = MSIZE - I 
U(JROW) = U(JROW) - UCOEF (JROW, 3) *U(JROW+1) 
90 CONTINUE 


COMPUTE THE R.H.S. FOR THE EVEN TRAVERSE, STORE IN V. DO THE 
TOP AND BOTTOM ONES. 


DO 95 I = 1,M 
goa (eta aL 
V(I) = ( 1.0/R - 2.0 )*U(J) + U(J+1) + BCNDV(I) 
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Figure 8.10 (continued) 
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K = MSIZE-M+#+tI1 


J=I*N 
V(K) = U(J-1) + ( 1.0/R - 2.0 )*U(J) + BCNDV(K) 
95 CONTINUE 


NOW THE REST OF THE ROWS 


DO 105 I = 2,NL1 
DO 100 J = 1,M 
K = (I-1)*M +9 
L = 1+ (J-1)*N 
V(K) = U(L-1) + (1.0/R-2.0)*U(L) + U(L+1) + BCNDV(K) 
100 CONTINUE 
105 CONTINUE 


GET THE SOLUTION FOR THE EVEN TRAVERSE 


v(1) = V(1) / VCOEF(1,2) 
DO 110 I = 2,MSIZE 
V(I) = ( V(I) - VCOEF(I,1)*V(I-1)) / VCOEF(I,2) 
110 CONTINUE 
DO 120 I = 1,MSL1 
JROW = MSIZE - I 
V(JROW) = V(JROW) - VCOEF (JROW, 3) *V(JROW+1) 
120 CONTINUE 
TIME = TIME + 2.0*DT 


PRINT OUT THE LAST RESULT. 
PRINT 202, TIME, ( V(I), I = 1,MSIZE ) 


202 FORMAT (//1X,’ WHEN T = ’,F6.3,’ V VALUES ARE’ /(1X,15F8.4)) 
GO TO 62 


END 


EXERCISES 


Section 8.1 


1, The parameters of the basic equation for unsteady-state heat transfer are dimensional. If it is 
desired to measure u in °F and x in inches, how must the units of k, c, and p be chosen in 
au _ cpau, 
ox? ik at” 
> 2. Solve for the temperatures at ¢ = 2.062 sec in the 2-cm-thick steel slab of Section 8.1 if the 
initial temperatures are given by the relation 


u(x, 0) = 100 sin =, 


2 


Use the explicit method with Ax = 0.25 cm. Compare to the analytical solution: 
100e~°3738 sin (77x/2). 
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3. Solve for the temperatures in a copper rod 10 in Jong, with the outer curved surface insulated 
so that heat flGws in only one direction. The initial temperature is linear from 0°C at one end 
to 100°C at the other, when suddenly the hot end is brought to 0°C, and the cold end is held 
at 0°C. Use Ax = | in and an appropriate value of Ar so that k Ar/cp(Ax)? = 5. Look up 
values of k, c, and p in a handbook. Carry out the solution for 10 time steps. 

4. Repeat Exercise 3 with Ax = 0.5 in, and compare the temperatures at points | in, 3 in, and 
6 in from the cold end in the two calculations. 

5. Repeat Exercise 3 with Ax = 1 in and Ar such that k Ar/cp(Ax)? = 4. Compare results with 
Exercises 3 and 4. 

> 6. Repeat computations with Ax as given below for the diffusion example of Section 8.1, 
and compare to the analytical solution at x = 4 and x = 12 cm. Carry them out until 
t = 268.8 sec. 
a) With Ax = 4 cm, D Ar/(Ax)? 125; 
b) With Ax = 2 cm, D Ar/(Ax)? 
c) With Ax = 1 cm, D Ar/(Ax)? 3 
If you use the results shown in Section 8.1 of the text, you now have data that illustrate the 
effect of size of Ax and of r on accuracy. Considering the amount of calculations required, 
which is the more effective way to improve accuracy? 

Section 8.2 

7. Solve Exercise 2 by the Crank—Nicolson method, r = 1. 

8. Solve Exercise 3 by the Crank—Nicolson method, r = 1. Compare results with Exercises 3, 
4, and 5. 

> 9. The methods of Sections 8.1 and 8.2 can be applied readily to more complicated situations. 
For example, if heat is being generated at various points along a bar at a rate that is a function 
of x, the unsteady-state heat equation becomes 
si ou 
7 eae f(x). 
Solve this equation where f(x) = x cal/cm® - sec, subject to conditions 
u(0, 1) = 0, u(l, 2) = 0, u(x, 0) = 0. 
Take Ax = 0.2, k = 0.37, cp = 0.433 in cgs units. (These are properties of magnesium.) 
Use the Crank—Nicolson method, r = 1, and solve for five time steps. 
Section 8.3 
10. Use the set of equations in Eq. (8.10) to find the solution to the example in Section 8.3 
through eight time steps. 
11. Solve the set of equations in Eq. (8.11) to find the solution to the example in Section 8.3 by _ 
the Crank—Nicolson method. Compare results with those from Exercise 10. 
>12. Solve the example of Section 8.3 but with two opposite faces of the cube losing heat at a 
rate equal to 0.15A(u — 70), where u is the surface temperature in °F, and A is the area. 
Use Ax = 1 in, and employ the explicit method with k Ar/cp(Ax)? = }. 
13. Solve Exercise 12 using the Crank—Nicolson method with r = 1. Compare results by the 
two methods of solution. 
>14. Heat is added to one end of a 2-ft-long bar of copper, at a rate given by 


—hA(u — ug), 
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and lost at the other end through a similar rate equation. Temperatures are uy = 500°F 
at the hot end, and ug = 60°F at the cold end, with resistance to heat flow being such that 
h = 0.3 Btu/sec - ft? - °F at both ends. The circumference of the bar is carefully insulated 
so that heat flows in only one dimension. Find the time required for the midpoint of the bar 
to reach 200°F. The bar is initially at 60°F at all points. 


Section 8.4 


Libs 


16. 


m19 


20. 


Demonstrate, by performing calculations similar to those given in Table 8.5, that the explicit 
method is unstable with k Ar/cp(Ax)? = 0.6. 


Demonstrate, by performing calculations similar to those given in Table 8.5, that the explicit 
method with k Ar/cp(Ax)? = + has errors that damp out less rapidly than those in Table 8.5, 
but that the method is still stable. 

Demonstrate, by performing calculations similar to those in Table 8.5, except that at both 
x = x, and x = x; the gradient is zero (du/dx = 0 instead of u = constant), that the explicit 
method is still stable with k Ar/cp(Ax? = 4. Note, however, how much more slowly an 
error damps out, and that the error at a later step becomes a linear combination of earlier 
errors. 

Demonstrate, by performing calculations similar to those in Table 8.6, that the Crank—Nicolson 
method is still stable even though the value of k At/cp(Ax)? is taken as 10. You will probably 
wish to use the LU method to solve the system. 

Compute the largest eigenvalue of the coefficient matrix of Eq. (8.16) for r = 0.5, then for 
r = 0.6, when N = 10, (You may wish to use the power-method program of Chapter 6.) Do 
you find that the statements in the text about the value of the eigenvalues are confirmed? 


Repeat Exercise 19 but for Eq. (8.18) using matrix A~'B. Use r = 1.0 and r = 2.0. 


Section 8.5 


21. 


»22. 


>24 


A rectangular plate 2 * 3 in. is initially at 50°. At t = 0, one 2-in. edge is suddenly raised 
to 100°, and one 3-in. edge is suddenly cooled to 0°. The temperature on these two edges is 
then held constant at these temperatures. The other two faces are perfectly insulated. Use a 
l-in, grid to subdivide the plate and write the A.D.I. equations for each of the six points 
where unknown temperatures are involved. Use r = 2, and solve the equations for four time 
intervals. 


Suppose the cube used for the example problem in Section 8.3 (4 x 4 * 4 in.) had heat 
flowing in all three directions, such as by having three adjacent faces lose heat by conduction 
to the flowing liquid. Set up the equations to solve for the temperature at any point. Do this 
in terms of the surrounding temperatures one Ar previously, using the explicit method with 
Ax = Ay 2 = 1. How many time steps are needed to reach ¢ = 15.12 sec? How many 
equations are involved in each stage of the calculation? 


Repeat Exercise 22 for the Crank—Nicolson method, r = 1. 
Repeat Exercise 22 using the A.D.I. method. 


Section 8.6 


25. 


In the finite-element method, one usually spaces the nodes more closely where u is expected 
to vary more rapidly. Suppose, in two adjoining elements, that the widths are A, and /y. 
Rederive the equivalent of Eq. (8.24) for this case. Can you get the same result by applying 
finite-difference approximations to d?u/dx* over three points, x;, x>, x3, that are unevenly 
spaced? 
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26. 


PAs 


Show that Eq. (8.25) does give the explicit formula of Section 8.5 for equispaced nodes that 
are connected to give four adjacent triangular elements that share the common central node. 
Solve Exercise 21 by the finite-element method. While we have not given the element integrals 
for a derivative boundary, in this case you can avoid having derivatives on the boundary by 
reflecting across the boundaries where these occur. 


APPLIED PROBLEMS AND PROJECTS 


28. 


When steel is forged, billets are heated in a furnace until the metal is of the proper temperature, 
between 2000°F and 2300°F. It can then be formed by the forging press into rough shapes 
that are later given their final finishing operations. In order to produce a certain machine part, 
a billet of size 4 x 4 = 20 in is heated in a furnace whose temperature is maintained at 
2350°F. You have been requested to estimate how long it will take all parts of the billet to 
reach a temperature above 2000°F. Heat transfers to the surface of the billet at a very high 
rate, principally through radiation. It has been suggested that you can solve the problem by 
assuming that the surface temperature becomes 2250°F instantaneously and remains at that 
temperature. Using this assumption, find the required heating time. 

Since the steel piece is relatively long compared to its width and thickness, it may not 
introduce significant error to calculate as if it were infinitely long. This will simplify the 
problem, permitting a two-dimensional treatment rather than three-dimensional. Such a cal- 
culation should also give a more conservative estimate of heating time. Compare the estimates 
from two- and three-dimensional approaches. 


After you have calculated the answers to Problem 28, your results have been challenged on 
the basis of assuming constant surface temperature of the steel. Radiation of heat flows 
according to the equation 


q = Eo(uj — ud) Btu/hr - ft?, 


where E = emissivity (use 0.80), o is the Stefan—Boltzmann constant (0.171 x 
10-8 Btu/hr - ft? - °F*), up and us are the furnace and surface absolute temperatures, 
respectively (°F + 460°). 

The heat radiating to the surface must also flow into the interior of the billet by conduction, 
so 


where k is the thermal conductivity of steel (use 26.2 Btu/hr - ft? - °F/ft) and (du/dx) is the 
temperature gradient at the surface in a direction normal to the surface. Solve the problem 
with this boundary conditon and compare your solution to that of Problem 28. (Observe that 
this is now a nonlinear problem. Think carefully how your solution can cope with it.) 


Shipment of liquefied natural gas by refrigerated tankers to industrial nations is becoming an 
important means of supplying the world’s energy needs. It must be stored at the receiving 
port, however. (A. R. Duffy and his coworkers (1967) discuss the storage of liquefied natural 
gas in underground tanks.) A commercial design, based on experimental verifiction of its 
feasibility, contemplated a prestressed concrete tank 270 ft in diameter and 61 ft deep, holding 
some 600,000 bbl of liquefied gas at —258°F. Convection currents in the liquid were shown 
to keep the temperature uniform at this value, the boiling point of the liquid. 

Important considerations of the design are the rate of heat gained from the surroundings 
(causing evaporation of the liquid gas) and variation of temperatures in the earth below the 
tank (relating to the safety of the tank, which could be affected by possible settling or frost- 
heaving.) 


| 


i 


31. 


33. 
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The tank itself is to be made of concrete 6 in thick, covered with 8 in of insulation (on 
the liquid side). (A sealing barrier keeps the insulation free of liquid; otherwise its insulating 
capacity would be impaired.) The experimental tests showed that there is a very small 
temperature drop through the concrete: 12°F. This observed 12°F temperature difference seems 
reasonable in light of the relatively high thermal conductivity of concrete. We expect then 
that most of the temperature drop occurs in the insulation or in the earth below the tank. 

Since the commercial-design tank is very large, if we are interested in ground temper- 
atures near the center of the tank (where penetration of cold will be a maximum) it should 
be satisfactory to consider heat flowing in only one dimension, ina direction directly downward 
from the base of the tank. Making this simplifying assumption, compute how long it will 
take for the temperature to decrease to 32°F (freezing point of water) at a point 8 ft away 
from the tank wall. The necessary thermal data are 


Insulation Concrete Earth 
Thermal conductivity (Btu/hr - ft - °F) 0.013 0.90 
Density (Ib/ft*) 2.0 150 
Specific heat (Btu/Ib - °F) 0.195 0.200 


Assume that the initial conditions are: temperature of liquid: —258°F, temperature of insu- 
lation: —258°F to 72°F (inner surface to outer); temperature of concrete: 72°F to 60°F; 
temperature of earth: 60°F. 


Modify Program | so that end conditions of the form 
ou 
pos 
au be c 


are permitted, similar to Program 2. Then use your program to solve Exercise 14. Vary the 
size of Ax and/or r = k At/cp(Ax)? until you are sure that the time for the midpoint temperature 
to become 200°F is known to within 0.5% relative error. 

Solve Exercise 14 by Program 2. Critically compare the efficiency of Program 2 in obtaining 
the solution with the efficiency of the program you wrote in Problem 31. 

Write a program similar to Program 3 but employing the A.D.1. method in three space 
dimensions with a parallelopiped-shaped region. Provide for derivative boundary conditions 
as an alternative to fixed boundary conditions by allowing for conditions of the form 


ou 
au + b—=c. 
ax 


= 


Hyperbolic 
Partial-Differential 
Equations 


9.0 CONTENTS OF THIS CHAPTER 


The third classification of partial-differential equations, hyperbolic differential 
equations, includes the “wave equation” that is fundamental to the study of 
vibrating systems. They are also involved in transport problems (diffusion of 
matter, neutron diffusion, and radiation transfer), wave mechanics, gas dynamics, 
supersonic flow, and other important areas. The technology in most of these 
applications is so complex that their study is beyond our scope. We will settle 
for simple situations within the present experience of our readers. One such case 
is the vibrations of a string held taut by fixed ends, as in a violin string. We 
outline the derivation of the simple wave equation in one dimension. Imagine 
an elastic string stretched between two fixed endpoints and set to vibrating by 
plucking it with a finger. 

We make a number of simplifying assumptions: the string is perfectly elastic, 
and we neglect gravitational forces, so that the only force is the tension force 
in the direction of the string; the string is uniform in density and thickness and 
has a weight per unit length of w Ib/ft; the lateral displacement of the string is 
so small that the tension can be considered to be a constant value of 7 Ib; and 
the slope of the string is hence small enough that sin a = tan a, where a is 
the angle of inclination. Figure 9.1 illustrates the problem with the lateral dis- 
placement greatly exaggerated. 
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Figure 9.2 


We take x = 0 at the left end. L is the total length of the string. Focus 
attention on the element of length dx between points A and B, The uniform 
tension 7 acts at each end. We are interested in how y, the lateral displacement, 
varies with time ¢ and with distance x along the string. 

The forces acting in the y-direction are the vertical components of the two 
tensions (see Fig. 9.2). We take the upward direction as positive: 


Upward force at left end: —T sin a, = —T tan a, = 


| 
\ 
y 
Re 
al< 
es 
> 


‘ay 
Upward force at right end: +T sin a, = +T tan a, = 7(2) 
B 


(2), -2(2)4) 


Net force: 


Partials are used to express the slope because y is a function of both ¢ (time) 
and x (horizontal distance). We use Newton’s law, and equate the force to mass 
times acceleration in the vertical direction. Our simplifying assumptions permit 
us to use w dx as the weight: 
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This is the wave equation in one dimension. The conditions imposed on the 
solution are the end conditions [ay + b(dy/dx) = f(t)] at x = 0 and x = L, and | 
the initial conditions at r = 0. Initial conditions specifying both y = f(x) and | 
the velocity dy/at = g(x) are usual for this problem. 

Comparing Eq. (9.1) to the general form of a second-order partial-differential 
equation, 

ay ay ay dy) _ 

ae + Baas + cB an? “$+ dl Xs ape 2) =0, 
shows B? — 4AC > 0, since B = 0, A = 1, and C = —Tg/w. We see that Eq. 
(9.1) falls in the class of hyperbolic equations. 

We will solve some typical hyperbolic partial-differential equations by two 
methods in this chapter. 


9.1 SOLVING THE WAVE EQUATION BY FINITE DIFFERENCES 
Uses finite-difference approximations of the derivatives similar to the technique of 
Chapters 6, 7, and 8, but we discover some unexpected situations. 


9.2 COMPARISON TO THE D’ALEMBERT SOLUTION 
Shows that the finite-difference solution can exactly match the analytical solution! 
9.3 STABILITY OF THE NUMERICAL METHOD 


Demonstrates that the finite-difference solution is stable when we use restricted 
values for Ar and Ax. 


9.4 METHOD OF CHARACTERISTICS 
Leads you into the intricacies of a method that is tedious but that can handle 
discontinuities. It is therefore of great value in many practical applications. 


9.5 THE WAVE EQUATION IN TWO SPACE DIMENSIONS 
Shows that finite-difference approximations permit the solution when there are two 
or even three spatial variables. 


9.6 CHAPTER SUMMARY 
Reviews the chapter in our accustomed style. 
9.7 COMPUTER PROGRAM 
Illustrates how the computer solves problems similar to those of Section 9.1. 


9.1 SOLVING THE WAVE EQUATION BY FINITE DIFFERENCES 


We attack the problem of solving the one-dimensional wave equation in the usual way 
by replacing derivatives by difference quotients. We use superscripts to denote time, and 
subscripts for position: 
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WED tle Ta (ist =D) oi :) 
(An? w (Ax? 


Solving for the displacement at the end of the current interval, at f = 4;,), we get 


= Tar) ul 
w(Ax? /° © 


Tg(At)? : 
y= ane it +9) = ot 2(1 


Note that selecting the ratio Tg(Av)?/w(Ax)? = 1 gives some simplification, though 
is still a function of conditions at both f, and 1 


yitl 


Be 


=I" 


Equation (9.2) is the usual way that the one-dimensional wave equation is solved 
numerically. We are, of course, interested in the validity of choosing the ratio at unity. 
Our finite-difference replacements to the derivatives do not have mixed error terms here, 
as they did in the explicit method for the heat equation; but, as we will later show, after 
discussing the method of characteristics, if the ratio is greater than one, we cannot be 
sure of convergence. Stability also sets a limit of unity to the ratio. It is surprising to 
find that, if one uses a value of w(Ar)?/Tg(Ax)* of less than one, the results are less 
accurate, while with the ratio equal to one we can get exact analytical answers!* 

There is still a problem in applying Eq. (9.2). We know y at ¢ = f9 = 0 from the 
initial condition, but to compute y at = Ar = 1, we need values at r = ¢_;. We may 
at first be bothered by a need to know displacements before the start of the problem; but 
if we imagine the function y = y(x, £) to be extended backward in time, the term f_, 
makes good sense. Since we will ordinarily get periodic functions for y versus f at a 
given point, we can consider zero time as an arbitrary point at which we know the 
displacement and velocity, within the duration of an ongoing process. 

One commonly used way to get values for the fictitious point at 7 = 1, is by 
employing the specification for the initial velocity dy/dt = g(x) at r= 0. Using a central- 
difference approximation, we have 


dy 4 y! - yj! ee 
aren 0) = ar g(x), 
yt =y! — 2g(4)At at = O only (9.3) 


Equation (9.3) is valid only at r = 0; substituting into Eq. (9.2) gives us, for 1 = 1), 


1 
yl= 30% + yy) + g(x)Ar. (9.4) 


*What this means is that, with w(Ar)?/Tg(Ax)? less than one, errors will decrease as Ax — 0; that is, the 
method is stable. However, with finite values for Ax, the computed values are not exact unless the ratio is 
equal to one. Our discussion of characteristics later in this chapter will clarify this unusual situation. 
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EXAMPLE 


9.2 


After computing the first line with Eq. (9.4), we use Eq. (9.2) thereafter.* If the boundary 
conditions involve-derivatives, we rewrite them in terms of difference quotients to incor- 
porate them in our difference equation. 


A banjo string is 80 cm long and weighs 1.0 g. It is stretched with a tension of 40,000 
g. At a point 20 cm from one end it is pulled 0.6 cm from the equilibrium position and 
then released. Find the displacement of points along the string as a function of time. 
How long does it take for one complete cycle of motion? From this, compute the frequency 
with which it vibrates. 

Let Ax = 10 cm. In Table 9.1 we show the results of using Eqs. (9.4) and (9.2). 
Because the string is just released from its initially displaced position, the initial velocity 
at all points is zero, and Eq. (9.4) becomes simply 


I 
y¥} = 50% + yew: 


The initial conditions imply that y is linear in x from y = 0 atx = 0 to y = 0.6 atx = 
20 and also linear to x = 80. The size of time steps is given by 


Tg(At)? _ 
w(Ax)? 


1 = (49,000)(980)(Az)? 
(1.0/80)(10)?* 


Ar = 0.000179 sec. 


After we have completed calculations for eight time steps, we observe that the y- 
values are reproducing the original steps but with negative signs and the end of the string 
reversed; that is, half a cycle has been completed in eight steps. For a complete cycle, 
16 Ar’s are needed. The frequency of vibration is then 


1 


f= (16)(0.000179) ~ 350 cycles/sec. 


(This value is exactly the same as that given by the standard formula, 


f= 3,VTaw.) 5 


The solution to our example as given in Table 9.1 is exactly the analytical solution, 
as we will now demonstrate. We will compare our finite-difference equation (Eq. (9.2)) 
with the d'Alembert solution to the wave equation. 


COMPARISON TO THE D’ALEMBERT SOLUTION 


The method of d’Alembert lets us find the analytical solution to the wave equation in 
one dimension. The one-dimensional wave equation, with VTg/w = c, is 


*We will later discuss a more accurate way to begin the solution. Equation (9.4) is satisfactory when the initial 


velocity is zero. 
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Table 9.1 Solution to banjo string example 
a 
t 0 10 20 30 40 50 60 70 80 
0 0.0 0.3 0.6 0.5 0.4 0.3 0.2 0.1 0.0 
At 0.0 0.3 0.4 0.5 0.4 0.3 0.2 0.1 0.0 
2At 0.0 0.1 0.2 0.3 0.4 0.3 0.2 0.1 0.0 
3Ar 0.0 =O031 0.0 0.1 0.2 0.3 0.2 0.1 0.0 
4ar 0.0 -0.1 —-0.2 “0:1 0.0 0.1 0.2 0.1 0.0 
SAr 0.0 —0.1 —-0.2 = 0:3) —-0,2 =H 0.0 0.1 0.0 
6Ar 0.0 =0:1 —0:2 —0.3 -0.4 -0.3 -0.2 =0:1 0.0 
Tat 0.0 -0.1 -0.2 -0.3 -0.4 =0.5. -0.4 -0.3 0.0 
Bar 0.0 —0.1 —0.2 -0.3 -0.4 -0.5 —0.6 -0.3 0.0 
9At 0.0 0.1 —0.2 -0.3 —0.4 —0.5 -0.4 -0.3 0.0 
10Ar 0.0 —0.1 U2 be) | —0.4 =~0:3 —).2: —031 0.0 
11Ar 0.0 =O =02 -0.3 -0.2 =0,1 0.0 0.1 0.0 
12Ar 0.0 -0.1 -0.2 -0.1 0.0 0.1 0.2 01 0.0 
i3dr 0.0 -0.1 0.0 O.1 0.2 0.3 0.2 O01 0.0 
14Ar 0.0 0.1 0.2 0.3 0.4 0.3 0.2 0.1 0.0 
1SAr 0,0 0.3 04 0.5 0.4 0.3 0,2 0.1 0.0 
16Ar 0.0 03 0.6 0.5 0.4 0.3 0.2 0.1 0.0 
2 ony 
“ = aes (9.5) 


By direct substitution, it is readily seen that, for any arbitrary functions F and G, 
Eq. (9.5) is solved by 


yx, 1) = F(x + ct) + G(x - ct). (9.6) 
The demonstration is easy, since 


dy Ax + ct) (x — ct) 


+ fe 


aoe a at = cF' — cG', 
®y _ apy a wer 
az — OF" — e(-0)G" = °F" + 0°", (9.7) 
dy _ a(x + et) HE — Ct) _ a i 
ax si ax #@ ox EG, 
=F +G". (9.8) 


Equation (9.5) results immediately from Eqs. (9.7) and (9.8). 
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The solution to the problem is found. then, if we can find a pair of functions of the 
form of (9.6) whose*sum matches the initial conditions and the boundary conditions for 


the problem. If the initial conditions are 


yx, 0) = f(x), 


Qe, 0) = g(x), 


it is again readily seen that y(x, 1) is given by 


tet 
y(x, t) = Spex + ct) f@— ct) + 7 ie g(v) dv. (9.9) 


Note that we can rewrite Eq. (9.9) in the form of Eq. (9.6). In Eq. (9.9) we have changed 
to the dummy variable v under the integral sign. The demonstration parallels that above 
when we recall how to differentiate an integral: 


ape oe a eee ae 
Fla ae g(v) dv} = zg lealx ct) = (c)eo— cH, 

ae a 1 1 1 

an Pl = g(v) av] = afi g(x + ct) + 3 g(x — «|= 368" - 308" =0, 


ax x x-ct 
2 fe afl 1 De ae 
ER [_. sora] = S| Sate +o - Sae- op = ge8 — oe 8 =0. 


a tor 
ale av) av] = LLetx oh ict): — gi = ct)l, 


Equation (9.9) gives the value of y at any time ¢ provided that y is known at points 
ct to the right and left of the point at ¢ = 0, and in terms of the integral of the initial 
velocity between the lateral points. Hence it is useful to find the displacement of points. 


in the interior portions of a vibrating string. 
We are now in a position to verify that our numerical procedure, given by Eq. (9.2), 


gives the exact solution (except for round-off), provided that two lines of correct y-values 
are known. The algorithm we use for the numerical procedure is 


We now show that the solution to this difference equation is a solution to the differential 


equation. Consider the function 


y= FQ, + ct) +: GG; = ct). (9.11) 
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Because we use 


then 


Write x; and 4; in terms of starting values, xo, and to = 0: 


=X + i Ax, 
ct; = c(t) + j At) = of Ar = j Ax. 


Substituting into Eq. (9.11), we 


a9 


et 


y} = F(xp + i Ax + j Ax) + G(xp + i Ax — j Ax) 
= F[xy + (i + JAx] + G[xp + (i — j)Ax]. 


Use this relation to rewrite the right-hand side of the difference equation, Eq. (9.10): 


F[xo + Gi — 1 + j)Ax] 

+ Glxp + (i — 1 -— j)Ax] 
+ F[xo + (6 + 1 + f)Ax] 
+ G[xo + (i + 1 — j)Ax] 


— Flxy + @§ + j — 1)Ax] 


— Glx + Gi —j + Ax] 
= F[xy + (i + j + 1)Ax) 
+ Glxy + Gi — j — 1I)Ax] = yZ*! 


1 


Equation (9.12) shows that, if the previous two lines of the numerical solution are of the 
form of Eq. (9.11) (and hence must be exact solutions of the wave equation in view of 
Eq. (9.6)), it follows that the values on the next line are exact solutions also, because 
they also are of the form of Eq. (9.11). 

In order that our simple algorithm give the analytical solution, it is then necessary 
only that two lines of the computation be correct. The first line is correct because it is 
given by the initial conditions. While Eq. (9.4) is frequently recommended for giving 
the second line in terms of the initial velocity, it is sometimes inaccurate because it 
assumes that the initial velocity and the average velocity from 1_, to t, are the same, 
and this may not be true.* 


*The relationship of Eq. (9.4) is correct whenever g(x) = constant. It is therefore correct for the case of 
a(x) = 0. 
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93 


Equation (9.9) offers a more exact way to get the second line of y-values. If we 
employ that relatidnship at r = Ar, we have 


thx 


1 1 
y(x;, At) = y} = zh + Ax) + f(a; — Ax)] + 3e cn g(v) dv 


Xj+1 


1 1 
= plyter + yal + Ef _ 80) dv. 


Equation (9.13) differs from Eq. (9.4) only in the last term. We now see why the banjo 
string example of Section 9.1 gave correct values in spite of using Eq. (9.4): In the 
example, g(x) was everywhere zero, so the form of the last term was unimportant. In 
the next example, we illustrate the difference between the two different ways to begin 
the solution. 


A string whose length is 9 units is initially in its equilibrium position. It is set into motion 
by striking it so that its initial velocity is given by dy/dt = 3 sin (7x/L). Take Ax = 1 
unit and assume that c = VTg/w = 2. When the ratio of c?(Ar)?/(Ax)? is unity, At = 
0.5 time unit. 

If the ends are fixed, find the displacements at ¢ = 0.5 time unit later. The length 
is subdivided into nine intervals because Ax = 1. 

The displacements we require are after one time step. Table 9.2 summarizes the 
computations. We first compute them using Eq. (9.4). The values disagree by several 
percent from the analytical values, which are given by 


LS oeemys i. SE 7x wet mx , Tet 
yx, 1) = el. 3 sin L dv = eloos (# L ) cos (= + L )}. 


When Ax is cut in half, Eq. (9.4) gives improved values; it now requires two time steps 
to reach t = 0.5, of course. 

Using Eq. (9.13), and evaluating the integral with Simpson’s rule, we obtain results 
with Ax = | that are accurate to within one in the fourth decimal place. 


STABILITY OF THE NUMERICAL METHOD 


Since we will ordinarily solve the one-dimensional wave equation numerically only with _ 


Tg(At)? 
w(Ax)? 


it is sufficient to demonstrate stability for that scheme. We assume that a set of errorless 
computations have been made when a single error of size 1 occurs; we trace the effects 


Table 9.2 


Table 9.3 
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Comparison of ways to begin the solution of the wave equation in one dimension: 


Hy _ ey 
act ax 


y(x, 0) = 0, Xx, 0) = 3sin (=) 


9 


over x = 0 to x = 9, with c = 2; 


em a a enn ei a 


Value of displacements at t= 0.5 


Using Eq. (9.4): Ax 


x i 2 3 4 
y 0.5130 0.9642 1.2990 1.4772 
Using Eq, (9.4): Ax = 0.5 
x 1 2 rs 4 
y 0.5052 0.9495 1.2793 1.4548 
Using Eq. (9.13); Ax = 1 
Simpson’s-rule integration 
x 1 2 3 4 
y 0.5027 0.9448 1.2729 1.4475 
Analytical solution 
% 1 2 a 4 
y 0.50267 0.94472 1.27282 1.44740 
. 
Propagation of single error in numerical solution to wave equation 
Initially error-free values 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 
Error made here 0 0 1 0 0 0 0 
0 1— 0 1 0 0 0 
0— 0 0 0 ~~ 0 0 
oI 0 0 oI 0 
0 oo 0 0 0 0 
0 0 o 4 0 -1— 0 
0 0 0 0 S24 ~—«O0 0 
0 0 0 a 0 
0 0 = 0 0 0 ~>O0 
0 -1 0 0 0 1 0 


of the single error. If the error does not have increasingly great effect on subsequent 


calculations, we call the method stable. 


This simple procedure is adequate because, since the problem is linear, the principle 
of superposition lets us add the effects of all errors together and lets us add these errors 
to the true solution to obtain the actual results. Table 9.3 demonstrates the principle, 
assuming that the displacement of the endpoints is specified so that these are always free 


of error. 
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As the arrows indicate, the wave equation propagates disturbances in opposite direc- 
tions, with reflections occurring at fixed ends, with a reversal of sign on reflection. 
Stability is demonstrated because the original error does not grow in size. 


9.4. METHOD OF CHARACTERISTICS 


The properties of the solution to the wave equation are further elucidated by considering 
the “characteristic curves” of the equation. This will also permit us to extend our numerical 
method to more general hyperbolic equations. 

Consider the second-order partial-differential equation in two variables x and r: 


au,, + buy, + cu, + e = 0. 


Here we have used the subscript notation to represent partial derivatives. The coefficients 
a, b, c, and e may be functions of x, t, u,, u,, and u, so the equation is very general.* 
We take u,, = u,,. To facilitate manipulations, let 


Ee Le 
P= = te @= 3 = Up 
Write out the differentials of p and q: 
age Pas 
dp ax + ar dt = u,, dx + uy, dt, 
cl a 
dq = Fade + FE dt = ty de + ty dt. 
Solving these last equations for u,, and u,,, respectively, we have 


_ dp — uy dt _ dp dt 


Me de de “itd” 
uy, = Ban a _ 
i dt dt at" 


Substituting in Eq. (9.14) and rearranging, we obtain 


+ buy ~ Cty F + al + cl + = 0. 


dt 


Aly 7 


*When the coefficients are independent of the function u or its derivatives, it is linear. If they are functions of 
U, Uy, OF u, (but not u,, OF u,,), it is called quasilinear. 
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Now multiplying by —dt/dx, we get 


Suppose, in the xt-plane, we define curves such that the first bracketed expression is zero. 
On such curves, the original differential equation is equivalent to setting the second 
bracketed expression equal to zero. That is, if 


where m = dt/dx, then the solution to the original equation (Eq. (9.14)) can be found 


by solving 


am dp + cdq+edt=0. 


We have elected to write Eq. (9.16) in the form of differentials. It will be seen that we 
reduce the original problem, which is a second-order partial-differential equation, to 
solving a pair of first-order equations of the form of Eq. (9.16). 

The curves whose slope m is given by Eq. (9.15) are called the characteristics of 
the differential equation. Since the equation is a quadratic, it may have one, two, or no 
real solutions, depending on the value of b> — 4ac. The value of this discriminant is the 
usual basis for classifying partial-differential equations. If 


b — 4ac < 0, 


the equation is called elliptic, and there are no (real) characteristics. If b> — 4ac = 0, 
there is a single characteristic at any point, and the equation is termed parabolic. When 
b? — 4ac > 0, at every point there will be a pair of characteristic curves whose slopes 
are given by the two distinct, real roots of Eq. (9.15). Such equations are called hyperbolic, 
and our present discussion considers only this type. 

Along the characteristics, the solution has special and desirable properties. For exam- 
ple, discontinuities in the initial conditions are propagated along them. On the charac- 
teristic curves, the numerical solution can be developed in the general case also. 
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Figure 9.3 


We will now outline a method of solving equations of the form of Eq. (9.14) by 
numerical integration along the characteristics.* We visualize the initial conditions as 
specifying the function u on some curve in the tx-plane,’ as well as its normal derivative. 
Consider two points, P and Q on this initial curve (Fig. 9.3). When Eq. (9.14) is 
hyperbolic, there are two characteristic curves through each point. The rightmost curve 
through P intersects the leftmost curve through Q, and these curves are such that their 
slopes are given by the appropriate roots of Eq. (9.15). Call m, the values of the slope 
on curve PR, and m_ the values on curve QR. 

Since these curves are characteristics, the solution to the problem can be found by 
solving Eq. (9.16) along them. 

Our procedure will be to first find point R (perhaps only as a first approximation if 
a, b, or c involve the unknown function u). This is done by solving the equation 


applied over the arcs PR and QR simultaneously. When dt/dx is a function of x and/or 
1, it may be possible to integrate Eq. (9.17) analytically. When dt/dx varies with u, we 
will use a procedure resembling the Euler predictor—corrector method, by predicting with 
m,, taken as equal to m, at P or m_ at Q to start the solution. We correct by using the 
arithmetic average of m at the endpoints of each arc as soon as the value of m at R can 
be evaluated. 

We then integrate Eq. (9.16) in the form 


*We discuss only the solution of hyperbolic differential equations by the method of characteristics, but the 
technique can also be applied to parabolic equations. 
‘This curve must not itself be one of the characteristics, or advancing the solution is impossible. 


9.4: METHOD OF CHARACTERISTICS 605 


starting first from P and then from Q, using the appropriate values of m for each. This 
will estimate p and q at point R. Finally we evaluate the function « at R from 


used in the form 


Au = Pay Ax + Qay At. 


The equation for Au can be applied to either the change along 
P—>R or OR; 


when these do not give the same result, we will average the two values. It may be 
necessary to iterate this procedure to get improved values at R. Average values are used 
for all varying quantities. The calculations are repeated for a second point S that is the 
intersection of chararacteristics through Q and another initial point W, and then continued 
in a like manner to evaluate u throughout the region in the xf-plane as desired. We 
illustrate with three examples, first with dt/dx equal to a constant, then with dt/dx varying 
with x, and finally a more complex example with dr/dx varying with u. 
For the simple wave equation, 


Un = Cys 


the slopes of the characteristics are 


and the characteristics are the lines 
1 
f= £-@ = X)): 
e 


Consider the curves from points P and Q, taken on the line for r = 0 (Fig. 9.4). The 
network of points used in the explicit finite-difference method of Section 9.1 are seen to 
be the intersections of characteristics through pairs of points spaced 2Ax apart. The finite- 
difference method, with 


Tg(Ax)? - 


w(At)? 1, 


will be found to be the equivalent of integration along the characteristics, lending further 
support to the likelihood that it will give exact answers. 
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Figure 9.4 


EXAMPLE! Solve 


au _ du _ 
on? ax? 


4, 


with initial conditions 
u=12x for O=x=0.25, 
u=4-4 for 0.255x=1.0, 


0 fe | OSesLO; 
ar 


boundary conditions are u = 0 at x = 0 and at x = 1.0. 
Putting the equation into the standard form, 


au Pu au 
ani tba tecax te =0, 


‘| 
03 
0. a 


‘ ose 


) 0.25 0.50 075 1.00 
Figure 9.5 x 
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gives a = —2, b = 0, c = 1, and e = 4. (The equation is linear since a, b, c, and e 
are independent of u, u,, and u,.) 
The slopes of the characteristics are the roots of 


—2m? + 1=0, 
sidan, YE 


so the characteristic curves are straight lines in the xt-plane, as shown in Fig. 9.5. Consider 
points P, Q, and R—(0.25, 0), (0.75, 0), and (0.5, 0.1768)—and solve Eq. (9.16), 
which is 


am,, Ap + c Aq + e Ar = 0. 


Along P > R: Ap + Ag + 4Ar = 0, 


-V2 (pp — pp) + (Gx — ap) + 4(0.1768) = 0: 


Along Q > R: -2{ 


~ 
5 )Ap + Ag + 4dr = 0, 


\ 2(pp — Pg) + (dr - 4g) + 4(0.1768) = 0. 


Using 
ae er = (#) =0 =(#) =-« 
Pe \ax)> Pe Natty PO Natl 
du 
ao = (2) =o, 
(), 
we find pp = —4, gp = -V 2/2 by solving the equations P ~ R and Q > R 


simultaneously. 
Now we evaluate u at point R through its change along P > R: 


— V2/2) 
on V2! 0.1768) 


Au = py Ax + aay = —4(0.25) + ( 


—1.0625, 
ug = 3 + (—1.0625) = 1.9375. 


(If we compute through evaluating Au along Q — R, we get the same result.) 


*The gradient has a discontinuity at x = 0.25. The value of du/dx for points to the right of P applies for the 
region PRQ. 
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Table 9.4 


EXAMPLE2 


u(t = 0) 0.0 3.0 2.0 1.0 

u(t = 0.1768) 0.0 0.9375 0.9375 
u(t = 0.3535) 0.0 1.1875 —0.2500 0.8125 
u(r = 0.5303) 0.0 ~1.3125 —2.4375 1.3125 


For this simple problem, the finite-difference method is much simpler, and we expect 
it to give the same results. Following the procedure of Section 9.1* we compute with 
Ax = 0.25, Ar = Ax/Ve = 0.1768, and obtain Table 9.4. The circled value agrees 
exactly with that calculated by the method of characteristics. « 


Solve 


Pu au 
ae 0 Se 


over (0, 1) with fixed boundaries and the initial conditions: 


ou 
u(x, 0) = 0, ae O) == l= 3). 


i 


For this problem, a = —(1 + 2x), b = 0, c = 1, e = 0. Then am? + bm +c 


0 gives 
1 
m=z Vas 2° 


The characteristic curves are found by solving the differential equations dt/dx 
V1/(1 + 2x) and dt/dx = —V1/(1 + 2x). Integrating’ from the initial point xo and fg, 
we have 


io + VI + 2x— V1 + 2x9 from m,, 


t=t— V1 +2x+ V1+ 2% from m_. 


t 


Figure 9.6 shows several of the characteristic curves. We select two points on the initial 
curve for r = 0, at P = (0.25, 0) and Q = (0.75, 0), whose characteristics intersect at 
point R. Solving for the intersection, we find R = (0.4841, 0.1782). 


*The algorithm is ui"! = (u/_, + wi_,) — uf"! — 4(Ar)? with Ar = Ax/V2. For the first time step, u! = 
3(u2., + uy) — 34)(An? 

"In this example, the integration methods of calculus are easy to use. We could use a numerical method if they’ 
were not. 


Figure 9.6 
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(0.3328, 0.2906) 


u = 0,0363 


We now solve Eq. (9.16) to obtain p = du/dx and q = du/dt at R: 


At point P: 


At point Q: 


=x — x? = 0.1875, 


VI1/(. + 2x) = 0.8165, 
-(1+2x)=-1.5, b=0, c=1, e=0. 
0.75, ©=0. u=0, p=0, g=x—x* = 0.1875, 


-VI/( + 2x) = —0.6325, 
-2.5,b=0, c=1, e=0. 
0.4841, 1 = 0.1783, 
VI/ + 20) = 0.7128, 
-V1/( + 20) = -0.7128, 
-1.9682, b=0, c=], 


609 
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EXAMPLE 3 


Equation (9.16) becomes, when we use average values for a and m, 


POR: —1.7341(0.7646)(pz — 0) + (1)(gz — 0.1875) = 0; 
QR: —2.2341(—0.6726)( pp — 0) + (1)(qg — 0.1875) = 0. 
Solving simultaneously, we get pp = 0. gp = 0.1875. 
We calculate the change in u along the characteristics: 
PR: Au = 0(0.2341) + 0.1875(0.1783) = 0.0334, 
OR: Au = 0(—0.2659) + 0,1875(0.1783) = 0.0334, 
Ug = 0 + 0.0334 = 0.0334. 


Figure 9.6 gives the results at several other intersections of characteristics. Students 
should verify these results to be sure they understand the method of characteristics. = 


Solve the quasilinear equation, with conditions as shown, by numerical integration along 
the characteristics. (This might be a vibrating string with tension related to the displace- 
ment u and subject to an external lateral force.) 


2 2 
oe ut +(l-x)=0, ux, 0)=x(1-—x, ux, 0) = 0, 


u0,0=0, ul, 1) = 0. (9.18) 


We will advance the solution beyond the start from P, at x = 0.2, 1 = 0, and Q, at 
x = 0.4, t = 0, to one new point R. Comparing Eq. (9.18) to the standard form, 


Qu,, + buy, + Cu, + € = 0, 


we have a = 1, b = 0, c = —u, e = 1 — x”. We first compute u, p, and q at points P 
and Q 


u = x(1 — x) 
(from the initial conditions), so 
up = 0.2(1 — 0.2) = 0.16, 
ug = 0.4(1 — 0.4) = 0.24. 
Also, 


ou 
ps hee, 


(by differentiating the initial conditions), so 


pp = 1 — 20.2) = 0.6, 
Po = 1 — 20.4) = 0.2; 
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and 


(from the initial conditions), so 
gp = 0, 
fo 


To locate point R, we need the slope m of the characteristic. Using am? — bm + 
c = 0, we get 


b + Vb? — 4ac 
2a 


Since m depends on the solution v, we will need to find point R through the predictor— 
corrector approach, In the first trial, use the initial values over the whole arc; that is, 
take m, = +mp and m_ = —mg: 


m, = Vip = V0.16 = 0.4, 
m_ = Vug = —V0.24 = —0.490. 
We now estimate the coordinates of R by solving simultaneously 


Ip = M4(Xp — Xp) = 0.4(x%g — 0.2), 


tp = M_(Xp — XQ) = —0.490(xg — 0.4). 
These give 
xp = 0.310, ty = 0.044. 


We write Eq. (9.16) along each characteristic, again using the initial values of m, since 
m at R is still unknown: 


am Ap + c Aq + e At = 0, 


0.04 + 0,096 
(1)(0.4)(pp — 0.6) + (—0.16)(qe — 0) + (1 7 008 080.044) =0, 
0.16 + 0.096 
(1)(—0.490( pp — 0.2) + (—0.24)qx — 0) + (1 = 0.044) =0 


In these equations we used the arithmetic average of x? in the last terms, Solving simul- 
taneously, we get 


Pr = 0.399, dr = —0.246. 


. 
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As a first approximation for u at R, then, 


Au = pAx+q Ar, 


0.6 aes (0.310 — 0.2) + 2— 9-464 o44 — 0), 


up — 0.16 = 5 
Ug = 0.2095. 


The last computation was along PR, using average values of p and q. We could have 
alternatively proceeded along QR. If this is done, 
np — 0:24= 22% 0399 9510 = 014) + O= 0.246 (0.044 - 0), 


ug = 0.2076. 


The two values should be close to each other. Let us use the average value, 0.2086, 
as our initial estimate of ug. We now repeat the work. In getting the coordinates of R, 
we now use average values of the slopes, 


0.4 + V0.2086 
tg = ——5 Ger — 0.2), 


xp = 0.305, te = 0.045; 


0.4 + V0.2086 0.16 + 0.2086 
01 Pe — 0.6) - (218 + 0.2088 on - 0) 


+ (1 - 2449.83) 045) = 0, 
an{ 288 = MODES, ay (024 +, 9.2006) (an - 0) 
+ (1 ~ O16 + 0.0920) 0,045) =; 
Pr = 0.398, qx = —0.242; 
Ug = 0.16 + 0:6 + 0.298 (0.308 — 0.2) + 0 = 0.282 (0.045 - 0), 
ug = 0.2071 — (along PR); 
i= Ot 02 +098 (0.305 = 0.4) + O= 0.242 (0.045 - 0), 


Ug = 0.2063 (along QR). 


The average value is 0.2067. 
Another round of calculations gives ug = 0.2066, which checks the previous value — 
sufficiently. This method is, of course, very tedious by hand. = 


9.5 
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THE WAVE EQUATION IN TWO SPACE DIMENSIONS 


The finite-difference method can be applied to hyperbolic partial-differential equations in 
two or more space dimensions. A typical problem is the vibrating membrane. Consider 
a thin flexible membrane stretched over a rectangular frame and set to vibrating. A 
development analogous to that for the vibrating string gives 


at? w 


oy _ a(t, 29 


ax?” ay? 


in which w is the displacement, ¢ is the time, x and y are the space coordinates, T is the 
uniform tension per unit length, g is the acceleration of gravity, and w is the weight per 
unit area. For simplification, let 7g/w = a. Replacing each derivative by its central- 
difference approximation, and using h = Ax = Ay, gives (we recognize the Laplacian 
on the right-hand side) 


K+ _ on k k-1 k k k 
uf cael an gilts + Minas + Mig + uy ‘= du; ‘aly 
(Ar)? he 4 
Solving for the displacement at time 4), we obtain 
RD 2 
key — a(An? eta 5 — gotdn*) 4 
we 1 0) Lou; pm UP 2 4 Re ij (9.20) 


In Eqs. (9.19) and (9.20), we use superscripts to denote the time. If we let 
a(Aty?/h? = 4, the last term vanishes and we get 


1 uh = uly} (9.21) 


For the first time step, we get displacements from Eq. (9.22), which is obtained by 
approximating du/dt at 1 = 0 by a central-difference approximation involving u! ; and 
aa . 


up: 


1 
MiG =] t @ 2 ul) + (Ar)g(x;, yj). (9,22) 


In Eq. (9.22), g(x, y) is the initial velocity. 
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EXAMPLE 


Ik should not surprise us to Jeam that tas ramp ar = 
for stability, m vi@w of our previous expenence with explict methods. 5 
contrast with the wave equation m one space dimension, we do mot eet exact 
from the namencal procedure of Bg. (9-21), and we farther observe that we 
smaller time steps im relation to the size of the space miervel. Theorie we 
time mor slowly. However, the numerical method is sraisitforward_ 2s the fc 
example will show. 


A membrane for which a7 = Tg/w = 3 is suetched ower 2 sgume frame Gt 
the region 0 <x <2, 0 <y <2, im the a-plume. E is given om inti 
described by 


u=22 — xp? - 7. 


and has an initial velocity of zero. ares a perans pct 

We divide the segion with k = Ax = Ay = 3, obtuse om mem 
points. Initial displacemems are calculatisd from the initial conditions: Ax, 3 
2(2 — xv — y) Ar is taken @ ts maximo valor for stability. &/(\V 22) = 0 
The values at the end of one time step are given by 


becanse p(x. y) im Eg. (9.22) as everywhere zero. For succesdime time steps, Bg. (9-21) 
is wsed. Table 9.5 gives the resus of om calculatons. Also shown mm Tatie 9.3 (a 
parentheses) are analytical values. computed from the double mite senes: 


ne ae 


ux, v7) = 


im the average accuracy: to approach closely to the analytical results. k = Ax = Ay 
be made smaller. When this is done, Ar will need to decrease in propordon. segue 
tmany time steps and leadims to many repecinons of the dlsorithm and cxtevaret 
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able 9.5 Displacements of a vibrating membrane—finite-difference method: At = h/(\/2a) 


Grid location > 
t } (0.5. 0.5) (1.0,0.5) (1.5.0.5) (0.5, 1.0) (1.0.1.0) (1.5.1.0) (0.5, 1-5) (1.0, 1.5) (1.5, 1.5) 


0 0.5625 0 0.5625 0.750 1.000 0.5625 0.750 0.5625 
(0.5625) = (0.750) 
24) (0.375 0.531 0.375 0.531 0.531 0.375 0.531 0.375 
} (0.380) (0.336) 
408 —-0.031 0.000 -0.031 0.000 0.000 -0.031 0.000 
(-0.044) 
612 | —0.375 -0.375 -0.531 -0.531 -0.375 -0.531 
(-0.352) 
816 | —0.500 —0.500 -0.750 -0.750 -0.500 -0.750 -0.500 
(-0.502) 
021 | —0.375 —0.375 -0.531 ~-0.531 -0,375 0.531 = -0.375 
(-0.407) (0.691) 
225 -0.031 -0.031 0.000 0.062 0.000 = -0.031 0.000 = -0.031 
(-0.015) (0.030) 
429 0.375 0.375 0.531 0.750 0.531 0.375 0.531 0.375 
(0.410) (0.688) 


of computer time. One remedy is the use of implicit methods. which allow the use of 
larger ratios of a7(Ar)*/s>. However, with many nodes in the xy-grid, this requires large. 
sparse matrices similar to the Crank-Nicolson method for parabolic equations in two 
space dimensions. A.D.I. methods have been used for hyperbolic equations—tidiagonal 
systems result. We do not discuss these methods. 8 


9.6 CHAPTER SUMMARY 


You can test your understanding of Chapter 9 by seeing if you can 

1. Determine if a partial-differential equation falls in the hyperbolic class. 

2. Set up equations to find the displacements of a vibrating string after one time step. 

and, from these. determine the displacements after nm time steps. 

Explain the D'Alembert method for the wave equation and how this shows that the 

finite-difference method can give exact answers. 

Outline the argument that demonstrates stability for the finite-difference procedure. 

Explain what characteristic curves are, find them for a typical problem, and compute 

a few points on the characteristics. 

6. Apply the finite-difference method to a simple hyperbolic partial-differential equation 
with two space dimensions. 


a) 


> 
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SELECTED READING FOR CHAPTER 9 
Smith (1978). 


9.7 COMPUTER PROGRAM 


Program | (Fig. 9.7) uses the easy algorithm of Section 9.1, but starts the solution 
use of Eq. (9.13). The integral of the initial velocity is computed by Simpson's 4 Tul 
with the interval from x;_, to x;,, subdivided into 20 panels. The initial displace: 
F(x) and the initial velocity g(x) are each defined by function subprograms. The pro; 
is tested by calculating the example problem of Section 9.2, in which a string initially 
at rest is set vibrating by giving it an initial velocity of g(x) = 3 sin (7x/L). Input valu 
are L = 9, Ax = 1, a = VTg/w = 2, so that nine time steps are required here to reach 
the time 1 = aL/4. 


PROGRAM WAVE (INPUT, OUTPUT) 


THIS PROGRAM COMPUTES SOLUTIONS FOR THE ONE DIMENSIONAL 
WAVE EQUATION. THE INITIAL DISPLACEMENTS OF THE VIBRATING 
STRING, FROM X=0 TO X=XLEN ARE GIVEN BY F(X), THE INITIAL 
VELOCITIES ARE GIVEN BY G(X). F(X) AND G(X) ARE DEFINED BY 
FUNCTION SUBPROGRAMS. THE END POINTS ARE ASSUMED FIXED AT 
ZERO DISPLACEMENT. 


THE RELATION USED IS U(I,J+1) = U(I+1,J) + U(I-1,J) - U(I,J-1) 
EXCEPT FOR THE FIRST TIME STEP, WHEN THE VALUE IS GIVEN BY 
U(I,1) = 0.5*( U(I-1,0) + U(I+1,0) ) + 0.5/C*INTEGRAL OF INIT VEL. 


PARAMETERS ARE : 


x - DISTANCE ALONG THE STRING 

DX - INCREMENTS OF DISTANCE 

XLEN - TOTAL LENGTH OF STRING 

N - NUMBER OF SUBDIVISIONS 

2y - TIME 

TLAST - FINAL VALUE OF TIME FOR WHICH SOLUTION IS DESIRED 
F(X) - INITIAL DISPLACEMENTS 

G(x) - INITIAL VELOCITIES 

TDM - VALUE OF TENSION / MASS = C SQUARED 
U - DISPLACEMENTS AT EVEN TIME INTERVALS 
Vv - DISPLACEMENTS AT ODD TIME INTERVALS 


nmaanaAAAAAAAAAANAAAAAAAAAAAAAAAAAAA 


REAL U(100),V(100) ,X,DX, XLEN, T, TLAST, F, G, TDM, SUBDX, XSUB, PI 
INTEGER N,NP1,1,J 
COMMON XLEN, PI 


Figure 9.7 Program 1. 
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‘igure 9.7 (continued) 


aaaaaaa aanaa aaaana 


aaaaa 


aaaaa 


DEFINE SOME DATA STATEMENTS FOR INITIAL VALUES 


DATA X,N,T/0.0,9,0.0/ 
DATA TDM, TLAST/4.0,4.5/ 
PI = 4,0*ATAN(1.0) 
XLEN = 9.0 


GET INITIAL DISPLACEMENTS 


NP1=N+#+1 

DX = XLEN / FLOAT(N) 

U(l) = 0. 

U(NP1) = 

Vil) = 1) 

V(NP1) = 

pO 10 I1=2 

X=X + D 
U(T) = F(X) 

10 CONTINUE 


WRITE OUT THE INITIAL DISPLACEMENTS 
PRINT 200, ( U(I), I = 1,NP1/2) 


NOW GET DISPLACEMENTS AFTER FIRST TIME STEP. USE SIMPSON’S RULE 
TO PERFORM INTEGRATIONS, USING 20 INTERVALS BETWEEN X(N-1) AND 
X(N+1). WE DO THIS FOR EACH INTERMEDIATE X VALUE. 


SUBDX = DX / 10.0 


XSUB = DX 
DO 30 I = 2,N 
suM = 0.0 


XSUB = XSUB =" DX 
DO 20 J = 1,19,2 
SUM = SUM + G(XSUB) + 4.0*G(XSUB+SUBDX) + G(XSUB+2.0*SUBDX) 
XSUB = XSUB + 2.0*SUBDX 
20 CONTINUE 
V(I) = 0.5*( U(I-1) + U(I+1) ) + 0.5/SQRT (TDM) *SUBDX/3.0*SUM 
30 CONTINUE 


WRITE OUT DISPLACEMENTS AFTER FIRST TIME STEP 


T = DX / SQRT(TDM) 
PRINT 201, T,( V(I), I = 1,NP1/2) 


NOW COMPUTE UNTIL TLAST IS REACHED 
35 IF ( T.,GE. TLAST ), STOP 


DO 40 I = 2,N 
U(I) = V(I-1) + V(I+1) -U(T) 
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Figure 9.7 (continued) = 
40 CONTINUE 
| T = T + DX/SOQRT(TDM) 
| PRINT 201, T,( U(I), I = 1,NP1/2) 
DO 50 I = 2,N 
v(I) = U(I-1) + U(I+1) - V(T) 


50 CONTINUE 
T = T + DX/SORT (TDM) 
A PRINT 201, T,( V(I), I = 1,NP1/2) 
GO TO 35 
[et 
(ee ee 
c 
200 FORMAT(//’ SOLUTION TO VIBRATING STRING PROBLEM ’,///, 
+ ‘ INITIAL DISPLACEMENTS ARE '//(1X,11F9.4) ) 
201 FORMAT(/’ AT T = ’,F5.2/(1X,11F9.4) ) 
END 
‘ e 
@) sete ee ee ee ee 
c 
REAL FUNCTION F(X) 
c 
REAL X 
F=0.0 
RETURN 
END 
c 
Cc 
Cc 
REAL FUNCTION G(X) 
Cc 


REAL X 


\ COMMON XLEN, PI 

! G = 3.0*SIN(PI*X/XLEN) 
RETURN 
END 


OUTPUT FOR PROGRAM 1 


j SOLUTION TO VIBRATING STRING PROBLEM 


INITIAL DISPLACEMENTS ARE 
-0000 +0000 +0000 -0000 -0000 


ATT = +50 
-0000 +5027 .9447 1.2728 1.4474 


ATT = 1.00 
-0000 «9447 2.7755 2.3921 2.7202 


AT? = 2-50. 
-0000 1.2728 2.3921 3.2229 3.6649 


ATT = 2.00 
0000 1.4474 2.7202 3.6649 4.1676 


ATT = 2.50 
-0000 1.4474 2.7202 3.6649 4.1676 


AT T= 3.00 
-0000 


T= 3.50 
-0000 


T= 4.00 
0000 


T= 4.50 
-0000 
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1.2728 3.2229 


9447 2.3921 


5027 1.2728 


-0000 0000 


EXERCISES 


Section 9.1 


> 1. 


If the banjo string in the example of Section 9.1 is tightened, its frequency of vibration is 
increased. Likewise, if the length of the vibrating portion of the string is shortened, as by 
holding it against one of the frets with the finger. the pitch is raised. What would the frequency 
be if the tension is made 42,000 g and the effective length is 70 cm? Determine the answer 
by finding the number of Ar steps for the original displacement to be duplicated and compare 
to 


A vibrating string system has Tg/w = 4 cm?/sec?. Divide the length L (L = 80 cm) into 
intervals so that Ax = L/8 cm. Find the displacement for: = 0 up tor = L if both ends are 
fixed. and the initial conditions are 


a) aL — x) oy 
eo 


a. == 0. 
Le or 
b) The string is displaced +1 units at L/4 and —1 units at SL/8. ay/or = 0. 
I y=0, 2 = aoe (Use Eq. (9.3) to begin the solution.) 


d) The string is displaced 1 unit at L/2, ay/at = —y. 
In part (2). compare to the analytical solution: 


qe aa Lap 1 sin {(2n — 1) rx} cos [(4n — 2). 
A function u satisfies the equation 
_ fu 
ar” 
with boundary conditions u = 0 atx = 0, w = 0 atx = 1, and with initial conditions 
u=snax. ou/ot=O0 for O=x<I, 


Solve by the finite-difference method and show that the results are the same as the analytical 
solution: 


U(x, 1) = sin 7x cos mt. 
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4. The ends of the vibrating string do not need to be fixed. Solve d?y/dt2 = 62y/ax? with i 
conditions y =*0, dy/dr = 0, for 0 < x < 1, and end conditions 


y=0 at x=0, y=sin mt/4, ay/ax=O0 at x=1. 


Section 9.2 


5. Why can’t we use Eq. (9.9) to solve the example in Section 9.1? 
>» 6. Since Eq. (9.4) is sometimes inaccurate when the initial velocity is not zero, the solutions 
to parts (c) and (d) of Exercise 2 are not exact. Repeat these problems, getting more accurate 
y-values by using Eq. (9.13) and evaluating the integral by Simpson's } rule 
7. A string that weighs w Ib/ft is tightly stretched from x = 0 to x = L and is initially at rest. 
Each point of the string is given initially a velocity of 


2 vo sin} =* 

dieame e L 

The analytical solution to this problem is 
ee woh (9 in gin ot = BOX ua) 
yx, 1) = 7579 sin | sin = in sin}. 


where a = VTg/w, T being the tension and g the acceleration of gravity. When L = 3 ft, 
w = 0.02 lb/ft, and T = 5 Ib, with vo = 1 ft/sec, the analytical equation predicts y = 0.081 
in at the midpoint when ¢ = 0.01 sec. Solve the problem numerically to confirm this prediction. 
Does your solution conform to the analytical solution at other values of x and 1? 


8. Demonstrate that if T'g(Ar)*/w(Ax)? > 1, the finite-difference method for the wave equation 
will propagate an error with increasing effect by tracing the growth of the error similar to 
that in Table 9.3. Take the simple case with ends fixed. Repeat for the ratio < 1. 


» 9, Trace the magnitude of an isolated error in solving the wave equation with Tg(Ar)?/w(Ax)? 
= 1, similar to Table 9.3, but for the case when the left-hand end is fixed and the right-hand 
end moves according to 


(1, 2) = sin = i 
aN me? 


0 at xT. 
Section 9.4 
10. For the partial-differential equation 

au,, + buy, + cu, + e = 0, 


sketch the characteristic curves through the point x = 0.5, 1 = 0 when: 


a) a@=1,b=2,c=1 ba = 1, b= 20 = 3 
)a=1, b= 26=3 d) a=1,b=2,c=-1 
e) a=P,b=xt,¢ = -2? 


11. Verify the values at points S, 7, U, and V in Fig. 9.6 for Example 2 of Section 9.4 by the 
method of characteristics. 


»>12. The equation 
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can be solved by the difference-equation method discussed in Sections 9.1 and 9.2. Solve it 
by that method with initial conditions of 


ou 
u(x, 0) = 0, ae 0) = x(1 — x). 


Compare these results with those from the method of characteristics in Fig. 9.6, 
13. Get part of the solution by integrating along characteristics for the equation 


subject to initial conditions 


Find the solution at three points in the .-plane, where the characteristics through (0.4, 0), 
(0.5, 0), and (0.6, 0) intersect 
14. Solve the equation 
Uy + Xt, — Uy = 0, 
with initial conditions 
ou 
u = 2x, — = 0 
or 
Boundary conditions are u(O, 1) = 0, uw, 1) = 2. 
Find the solution at several points in the region of the x7-plane bounded by the lines x = 0.4, 
x= 0.6 


15. Continue the solution of Example 3 of Section 9.4, 
= ut, + 1 =x? = 0, u(x, 0) = x(1 — x), u(x, 0) = 0, 
u(O, 1) = 0, u(1, t) = 0, 


by finding the solution at the intersection of characteristics through points at (0.4, 0) and 
(0,6, 0). Then find the solution at the intersection of characteristics through (0.2, 0) and (0.6, 
0) as an example of how the solution can be advanced in time, 


Gis 


W 


Section 9.5 


16, Solve the vibrating membrane example of Section 9.5 but with initial conditions of 


ou 
= (x, y) = x(2 — x)y(2 — y). 


u(x, y) = 10; 
(x, Nheo nD) 


»17. A membrane is stretched over a frame that occupies the region in the xy-plane bounded by 
x=0, =2, y =0, y = 4, 


Initially the point at (1, 3) is lifted 1 in above the xy-plane, and then released, If T = 
7 Ib/in and w = 0.07 Ib/in?, find the displacement at the point (1, 2) as a function of time. 


18, How do the vibrations in Exercise 17 change if w = 0.7, the other parameters being unchanged? 


19. The frame holding the membrane in Exercise 17 is distorted by lifting the corner at (2, 4) 
so it is 1 in above the xy-plane. (The frame members stretch at the same time, so the corner 
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20. 


23. 


point moves only vertically.) The membrane is set to vibrating in the same way as in Exercise 
17. Follow“its vibrations through time. (Assume that the initial positions of points on the 
membrane lie on intersecting planes.) 


A vibrating string that has a damping force opposing its motion that is proportional to the 
velocity, follows the equation 
ay _Tedy ay 


aw ax? at” 


where B is the magnitude of the damping force. Solve the problem if the length of the string 
is 5 ft with T = 24 Ib, w = 0.1 lb/ft, and B = 2.0. Initial conditions are 


Wleno = 5 0<x<3, 
5 

Wheo = 5-5 3<x<S, 

a 

| em 5). 


Compute a few points of the solution by both difference equations and the method of char- 

acteristics. Compare the effort involved in the two methods. { 
A horizontal elastic rod is initially undeformed and is at rest. One end, at x = 0, is fixed, 
and the other end, at x = L (when r = 0), is pulled with a steady force of F Ib/ft?. It can | 
be shown that the displacements y(x, t) of points originally at the point x are given by 


ay ay F 
: y(0, 1) = 0, = ay, 
or % ar|,-, E 
2 
y(x, 0) = 0, Fa We =0 


| 

where a? = Eg/p; E = Young's modulus (1b/ft?); g = acceleration of gravity; p = density 
(Ib/ft*). Find y versus r for the midpoint of a 2-ft-long piece of rubber for which E = 
1.8 x 10° and p = 70 if F/E = 0.7. | 
A circular membrane. when set to vibrating, obeys the equation (in polar coordinates) | 

1 2( =) lau w dy 

=e | + 

rar\ ar) rae Teg ar? 
A 3-ft-diameter kettledrum is started to vibrating by depressing the center } in. If w = 0.072 _ 
lb/ft? and T = 80 Ib/ft, find how the displacements at 6 in and 12 in from the center vary 
with time. The problem can be solved in polar coordinates, or it can be solved in rectangular 
coordinates using the method of Section 7.8 to approximate V7u near the boundaries. 
A flexible chain hangs freely, as shown in Fig. 9.8. For small disturbances from its equilibrium. 
position (hanging vertically), the equation of motion is 


Figure 9.8 


24, 
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In this equation, x is the distance from the end of the chain, y is the displacement from the 
equilibrium position, f is the time, and g is the acceleration of gravity. A 10-ft-long chain is 
originally hanging freely. It is set into motion by striking it sharply at its midpoint, imparting 
a velocity there of | ft/sec. Find how the chain moves as a result of the blow. If you find 
you need additional information at ¢ = 0, make reasonable assumptions. 

Write a computer program that solves hyperbolic partial-differential equations by the method 
of characteristics, given the values of u and du/dt along a curve that is not one of the 
characteristics. Your program should be a subroutine that accepts the coordinates of two points 
on the initial-condition curve and computes the value of u and du/dr at the intersection of 
the characteristic curves through the two points. 


——— oa 


Curve-Fitting and 
Approximation of Functions 


10.0 CONTENTS OF THIS CHAPTER 
tee SESS Sa Se 


In an early chapter, when we studied how to fit a polynomial to a set of data, 
we made the implicit assumption that the data were free of error (except round- 
off) so that it was appropriate to match the interpolating polynomial exactly at 
the data points. In the case of experimental data, this assumption as to accuracy 
is often not true. Each data point is subject to experimental errors that, in the 
case of complicated measurements, may be relatively large in magnitude. Again, 
in the previous chapter, the true function that relates the data was generally 
unknown, while, in the case of experimental results, the form of the function is 
frequently known from the physical laws that apply. 

We wish to consider the problem of finding the “best” curve that represents 
data that are subject to error. “Best” is in quotation marks because the criterion 
of goodness of fit is to some degree arbitrary, although the least-squares criterion 
is commonly applied. 

We will also consider in this chapter the most efficient way in which functions 
can be represented by a polynomial or a ratio of polynomials over a given range 
of values of the argument. This is of special importance in connection with the 
library function subroutines for digital computers. 

The final topic of this chapter is a discussion of the approximation of func- 
tions by trigonometric series. This leads us right away to a study of the Fourier 
transform and, in particular, the consideration of FFTs (fast Fourier transforms). 
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10.1 


10.2 


10.3 


10.4 


10.5 


10.6 


10.7 


10.1: LEAST-SQUARES APPROXIMATIONS 


LEAST-SQUARES APPROXIMATIONS 

Describes how to fit the “best” straight line to a set of data points, where best is 
defined in terms of a statistical criterion. A set of linear equations must be solved 
to get the parameters of the line. 

FITTING NONLINEAR CURVES BY LEAST SQUARES 

Extends the least-squares procedure to handle curves; if the function chosen to fit 
the points is a polynomial (the usual case), again a linear system must be solved, 
but it is possible to use a more complex function, You learn how to determine the 
appropriate degree when the function is a polynomial. 

CHEBYSHEV POLYNOMIALS 

Develops the theory of a class of orthogonal polynomials that are the basis for 
fitting more complex functions by polynomials of maximum efficiency for 
computers. 

APPROXIMATION OF FUNCTIONS WITH ECONOMIZED POWER 
SERIES 

Shows how the Chebyshev polynomial 
imations that are significantly more e 


can be used to create polynomial approx- 
ent than a standard power series. 
APPROXIMATION WITH RATIONAL FUNCTIONS 

Applies the previous techniques to generate the coefficients of rational functions 
(ratios of polynomials) that are still more efficient. 

APPROXIMATION OF FUNCTIONS WITH TRIGONOMETRIC 
SERIES: FAST FOURIER TRANSFORMS 

A sum of sine and cosine terms may fit a function better than other types, especially 
when the function is periodic. When fit to equispaced data, the fast Fourier trans- 
form (FFT) gets the coefficients with much less computational effort. 
CHAPTER SUMMARY 

Is the usual summation. 

COMPUTER PROGRAMS 


Presents a number of FORTRAN programs that implement algorithms of Chapter 
105 
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LEAST-SQUARES APPROXIMATIONS 


Suppose we wish to fita curve to an approximate set of data, such as from the determination 
of the effects of temperature on a resistance by students in their physics laboratory. They 
have recorded the temperature and resistance measurements as shown in Fig. 10.1, where 
the graph suggests a linear relationship. We want to suitably determine the constants a 
and in the equation relating resistance R and temperature T, 


R=aT +5, (10.1) 


‘Figure 10.) 
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Figure 10.2 


Figure 10.3 
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We might accept the criterion that we make the magnitude of the maximum error a 
minimum (the so-called minimax criterion), but for the problem at hand this is rarely 
done.* This criterion is awkward because the absolute-value function has no derivative 
at the origin, and it also is felt to give undue importance to a single large error. The 
usual criterion is to minimize the sum of the squares of the errors, the “least-squares” 
principle.’ 

In addition to giving a unique result for a given set of data, the least-squares method 
is also in accord with the maximum-likelihood principle of statistics. If the measurement 
errors have a so-called normal distribution and if the standard deviation is constant for 
all the data, the line determined by minimizing the sum of squares can be shown to have 
values of slope and intercept that have maximum likelihood of occurrence. 

Let ¥; represent an experimental value, and let y, be a value from the equation 


ye = ax; + b, 


where x; is a particular value of the variable assumed free of error. We wish to determine 
the best values for a and b so that the y’s predict the function values that correspond to 
x-values. Let e; = ¥, — y;. The least-squares criterion requires that 


*We will use this criterion later in this chapter, however 
+The various criteria for a “best fit” can be described by minimizing a norm of the error vector. Relate each 
criterion to its corresponding vector norm to review the definition of such norms. 
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be a minimum. N is the number of x,¥-pairs. We reach the minimum by proper choice 
of the parameters a and b, so they are the “variables” of the problem. At a minimum for 
S, the two partial derivatives dS/da and dS/db will both be zero. Hence, remembering 
that the x, and Y; are data points unaffected by our choice of values for a and b, we have 


= ax; = bem); 


0 
0 


N 
Chy 
aa = = 2 20% 
N 
os 
a © 20 ~ ax, — bX-D. 
Dividing each of these equations by —2 and expanding the summation, we get the so- 
called normal equations 


aLat od x= Lah 
a> x,+bN=D>Y, 


All the summations in Eq. (10.2) are from i = 1 to i = N. Solving these equations 
simultaneously gives the values for slope and intercept a and b. 
For the data in Fig. 10.1 we find that 


N=5, > 7 =273.1, > 7? = 18,607.27, > R, = 4438, 
S GR, = 254,932.5. 
Our normal equations are then 


18,607.27a + 273.1b = 254,932.5, 
273.1a + Sb = 4438. 


From these we find a = 3.395, b = 702.2, and hence write Eq. (10.1) as 
R = 702 + 3.397. 


10.2 
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FITTING NONLINEAR CURVES BY LEAST SQUARES 


In many cases, of course, data from experimental tests are not linear, so we need to fit 
some other function than a first-degree polynomial to them. Popular forms that are tried 
are the exponential forms 


y = ax? 
or 


y= ae*, 


We can develop normal equations for these analogously to the preceding development 
for a least-squares line by setting the partial derivatives equal to zero. Such nonlinear 
simultaneous equations are much more difficult to solve* than linear equations, Because 
of this, the exponential forms are usually linearized by taking logarithms before deter- 
mining the parameters: 


Iny=Ina+blnx 
or 
In y = Ina + bx. 


We now fit the new variable z = In y as a linear function of In x or x as described 
earlier. Here we do not minimize the sum of squares of the deviations of Y from the 
curve, but rather the deviations of In Y, In effect, this amounts to minimizing the squares 
of the percentage errors, which itself may be a desirable feature. An added advantage of 
the linearized forms is that plots of the data on either log-log or semilog graph paper 
show at a glance whether these forms are suitable by whether a straight line represents 
the data when so plotted. 

In cases when such linearization of the function is not desirable, or when no method 
of linearization can be discovered, graphical methods are frequently used; one merely 
plots the experimental values and sketches in a curve that seems to fit well, Special forms 
of graph paper, in addition to log-log and semilog, may be useful (probability, log- 
probability, and so on). Transformation of the variables to give near linearity, such as by 
plotting against 1/x, 1/(ax + b), 1/x*, and other nonpolynomial forms of the argument 
may give curves with gentle enough changes in slope to allow a smooth curve to be 
drawn. S-shaped curves are not easy to linearize; the Gompertz relation 


y = ab" 


is sometimes employed. The constants a, b, and c are determined by special procedures, 
Another relation that fits data to an S-shaped curve is 


1 
—~=ar+t be™*, 
y 


*They are treated briefly in Chapter 2. 
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In awkward cases, subdividing the region of interest into subregions with a piecewise fit 
in the subregions can be used. 

The objection to the last-mentioned methods is the lack of uniqueness. Two indi- 
viduals will usually not draw the same curve through the points. One’s judgment is ) 
frequently distorted by one or two points that deviate widely from the remaining data, 
Often one tends to pay too much attention to the extremities in comparison to the points U 
in the central parts of the region of interest. 

Further problems are caused if we wish to integrate or differentiate the function. Our 
discussion of least-squares polynomials is one solution to these difficulties. 

Because polynomials can be readily manipulated, fitting such functions to data that 
do not plot linearly is common. We now consider this case. It will turn out that the normal 
equations are linear for this situation, which is an added advantage. In the development, 
we use n as the degree of the polynomial and N as the number of data pairs. Obviously 
if N = n + 1, the polynomial passes exactly through each point and the methods of 
Chapter 3 apply, so we will always have N > n + 1 in the following. 

We assume the functional relationship 


y=agtaxt age +++++a,x', (10.3) 
with errors defined by 


: 
e = %— y= %— dg — ax — xt —- + = a,x. 


We again use Y; to represent the observed or experimental value corresponding to x,, with 
x, free of error. We minimize the sum of squares, 


N N 
S= > 8 => (ay — ax, — ax — ++ — a xt. 


At the minimum, all the partial derivatives dS/dag, dS/da,, .. . , AS/da,, vanish. Writing | 
the equations for these gives n + 1 equations: 


os py 


—= UY; — ag — ax; — +++ — ayx"(-1), 
aa, pa (Y; — ag — a4; a,x?)(— 1). 

as . 

= =0= > 2(Y; — ag — a\x; — + * + — a;x7)(—%), 
day i=1 

as . 

—=0= Sy 200; = 09 = OX — 7 = * — 4, 
da, A 


Dividing each by —2 and rearranging gives the n + | normal equations to be solved) 
simultaneously: ; 


i 


ae 


Dx, 


aN + a, > x; +a, x7 +-+++a, x7 


ax ta Dx +a 4 +e 4a, Dx 
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a uUeta de ta dxt +--+ ta, xt? => xy, 


a> xt + ay > MPT bay Sx tess ba, Dx = S x"Y;, (10.4) 


Putting these equations in matrix form shows an interesting pattern in the coefficient 
matrix. 


N Dy 2H Dae Dt pp 

Day Le Da Dat Daf Dat 
Dt De Det Dal De® | a xl oe 
Se Veet Yee Dees Daze Sy, 


(10,5) 


All the summations in Eqs. (10.4) and (10.5) run from 1 to N. 

Solving large sets of linear equations is not a simple task, Methods for this are the 
subject of Chapter 2. These particular equations have an added difficulty in that they have 
the undesirable property known as ill-conditioning. The result of this is that round-off 
errors in solving them cause unusually large errors in the solutions, which of course are 
the desired values of the coefficients a; in Eq. (10.3). Up to n = 4 or 5, the problem is 
not too great (that is, double-precision arithmetic in computer solutions is only desirable 
and not essential), but beyond this point special methods are needed. Such special methods 
use orthogonal polynomials in an equivalent form of Eq. (10.3). We will not pursue this 
matter further,* although we will treat one form of orthogonal polynomials later in this 
chapter in connection with representation of functions. From the point of view of the 
experimentalist, functions more complex than fourth-degree polynomials are rarely 
needed, and when they are, the problem can often be handled by fitting a series of 
polynomials to subsets of the data, 

The matrix of Eq. (10.5) is called the normal matrix for the least-squares problem. 
There is another matrix that corresponds to this, called the design matrix. It is of the 
form 


) oe ir 1 

Xp X X3 ay 

A 38! 33 2 

a 3 8 XN 
A = 

iB OG XN 


*Ralston (1965) is a good source of further information. The ill-conditioning problem, though very real, is 
often academic, since it is seldom that a degree above 4 or 5 is needed to give a curve that fits the data with 
adequate precision 
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It is easy to show that AA? is just the coefficient matrix of Eq. (10.5). It is also easy 
see that Ay, wheré’V is the column vector of Y-values, gives the right-hand side of 
(10.5). (You ought to try this for, say, a 3 X 3 case to reassure yourself.) This 
that we can rewrite Eq. (10.5), in matrix form. as 


AATa = Ba = Ay. 


Usually we would use Gaussian elimination to solve the system, but because B has s 
properties, we can use other methods that avoid the problem of ill-conditioning that 
pointed out above. 


1. The matrix B = AA’, is symmetric and positive semidefinite. A n x n matrix 
is said to be positive semidefinite if, for every n-component vector, x7Mx = 0. 
If we add the condition that x7Mx = 0 only if x is the zero vector, M is said to 
be positive definite. (You should show that B is positive semidefinite and 
symmetric.) 


2. In linear algebra, it is shown that B can be diagonalized by an orthogonal matrix 
Pp. 
PBP" = PAA™PT = D, 


where the diagonal elements of D are the eigenvalues of B. Note that orthog- 
onality implies that PP? = /, the identity matrix. 

3. Since B is positive semidefinite, all of its eigenvalues are nonnegative. This 
means that we can define a matrix S as 


S=WD; - “ore YSt=tD: 


The diagonal elements of S are called the singular values of A. 
4. We can rewrite Eq. (10.5) and its solution as follows: 


AATa = P™DPa = PS(PS)'a = Ay, 
a = PD" 'PTAy. 


This last eliminates having to multiply out AA? and, by extending this approach, — 
leads to an important method for solving Eq. (10.5) called singular-value 
decomposition. } 


Table 10.1 Data to illustrate curve-fitting 


x 0.05 0.11 0.15 0.31 0.46 0.52 0.70 0.74 0.82 0.98 1.17 


956 0.890 0.832 0.717 0.571 0.539 0.378 0.370 0.306 0.242 0.104 : 


> x, = 6.01 N=11 

DS 2 = 4.6545 > y, = 5.905 
> 3 = 4.1150 > xY;, = 2.1839 
BF 


xt = 3.9161 > 2, = 1.3357 


Figure 10.4 


10.2: FITTING NONLINEAR CURVES BY LEAST SQUARES 633 


We illustrate the use of Eqs. (10.4) to fit a quadratic to the data of Table 10.1. Figure 
10.4 shows a plot of the data. (The data are actually a perturbation of the relation y = 
1 — x + 0.2x*. It will be of interest to see how well we approximate this function.) To 
set up the normal equations, we need the sums tabulated in Table 10.1, A calculator can 
give these as accumulated totals directly. A computer program at the end of this chapter 
is designed to set up the matrix and the right-hand-side vector, We need to solve the set 
of equations 


llay + 6.0la, + 4.6545a, = 5.905, 
6.0lay + 4.6545a, + 4.1150a, = 2.1839, 
4.6545ay) + 4.1150a, + 3.9161a, = 1.3357. 


The result is dy = 0.998, a; = —1.018, ay = 0.225, so the least-squares method gives 


y = 0.998 — 1.018x + 0.225x?, 


Compare this to y = | — x + 0.2x°. We do not expect to reproduce the coefficients 
exactly because of the errors in the data. 


08 } 
| 
| 
06 | 
| 
f(x) } 
04 | 
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. 
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EXAMPLE 


Table 10.2 


In the genera] case, we may wonder what degree of polynomial should be used. As 
we use higher-degree polynomials, we of course will reduce the deviations of the points 
from the curve until, when the degree of the polynomial, n, equals N — 1, there is an 
exact match (assuming no duplicate data at the same x-value) and we have the interpolating 
polynomials of Chapter 3. The answer to this problem is found in statistics. One increases 
the degree of approximating polynomial so long as there is a statistically significant 
decrease in the variance, 07, which is computed by 


Oe Tren ange (10.6) 


For the above example, when the degree of the polynomial made to fit the points is 
varied from 1 to 7, we obtain the results shown in Table 10.2. 

The criterion of Eq. (10.6) chooses the optimum degree as 2. This is no surprise, 
in view of how the data were constructed. It is important to realize that the numerator 
of Eq. (10.6), the sum of the deviations squared of the points from the curve, should 
continually decrease as the degree of the polynomial is raised. It is the denominator of 
Eq. (10.6) that makes 7 increase as we go above the optimum degree. In this example, 
this behavior is observed for n = 3. Above n = 3, a second effect sets in. Due to ill- 
conditioning, the coefficients of the least-squares polynomials are determined with poor 
precision. This modifies the expected increases of the values of 0°. 

Before leaving this section, we illustrate how to apply these methods to more com- 
plicated functions. 


The results of a wind tunnel experiment on the flow of air on the wing tip of an airplane 
provide the following data: 


R/C: 0.73, 0.78, 0.81, 0.86, 0.875, 0.89, 0.95, 1.02, 1.03, 1.055, 1.135, 1.14, 
1.245, 1.32, 1.385, 1.43, 1.445, 1.535, 1.57, 1.63, 1.755; 


+ 16.645x4 — 14.346x° + 8.141x° — 2.29317 


a 
Equation (Eq. (10.6)) ye 
y = 0.952 — 0.760x 0.0010 0.0092 
20m .998 — 1.018x + 0.225x2 0.0002 0.0018 
3 | 004 — 1.079x + 0.351x2 — 0.069x3 0.0003 0.0018 
4 y = 0.998 — 0.838x — 0.522x? + 1.040x° 0.0003 0.0016 
— 0.454x4 
5 y = 1.031 — 1.704x + 4.278%? — 9.4773 0.0001 0.0007 
+ 9.3944 — 3.290x5 
6 | -y = 1.038 — 1.910x + 5.952x? — 15.0783 0.0002 0.0007 
| + 18.277x4 — 9.835x5 + 1.836x° 
7 | y = 1.032 — 1.742r + 4.694x2 — 11.8983 0.0002 0.0007 
| 
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Vo/Vici 0.0788, 0.0788, 0.064, 0.0788, 0.0681, 0.0703, 0.0703, 0.0681, 0.0681, 
0.079, 0.0575, 0.0681, 0.0575, 0.0511, 0.0575, 0.049, 0.0532, 0.0511, 
0.049, 0.0532, 0.0426; 


where R is the distance from the vortex core, C is the aircraft wing chord, Vy is the vortex 
tangential velocity, and V,. is the aircraft free-stream velocity. Let x = R/C and 
y = V)/V... We would like our curve to be of the form 


and our least-squares equations become 


1 


a 
ll 
M 


(yy, ~ g(x)? 


5 (s, = zt -e “hy, 


Setting S$, = S, = 0 gives the following equations: 


21 
S (lag. — arate. — fa — »- avd) ) = 
> (2a e (>, Za =e) = 0, 


i 


i 


i 
imMey 


21 
as ey, - Aq - emi) = 0. 


= i 


Program 3 (Fig. 10,10) solves this system of nonlinear equations using NLSYST, to give 
us 


g(x) = Oo7e8 -e? 30574x7) 


For these values of A and A, S = 0.000016. The graph of this function is presented in 
Fig. 10.5. = 
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V, versus V. 


4 
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Figure 10.5 RIC 


10.3 CHEBYSHEV POLYNOMIALS 


We turn now to the problem of representing a function with a minimum error. This is a 
central problem in the software development of digital computers because it is more 
economical to compute the values of the common functions using an efficient approxi- 
mation than to store a table of values and employ interpolation techniques. Since digital 
computers are essentially only arithmetic devices, the most elaborate function they can 
compute is a rational function, a ratio of polynomials. We will hence restrict our discussion 
to representation of functions by polynomials or rational functions. 

One way to approximate a function by a polynomial is to use a truncated Taylor 
series. This is not the best way, in most cases. In order to study better ways, we need 
to introduce the Chebyshev polynomials. 

The familiar Taylor-series expansion represents the function with very small error 
near the point of the expansion, but the error increases rapidly (proportional to a power) 
as we employ it at points farther away. In a digital computer, we have no control over 
where in an interval the approximation will be used, so the Taylor series is not usually 
appropriate. We would prefer to trade some of its excessive precision at the center of the 
interval to reduce the errors at the ends. 

We can do this while still expressing functions as polynomials by the use of Chebyshev 
polynomials. The first few of these are* 


*The commonly accepted symbol 7(x) comes from the older spelling, Tschebycheff. 


Figure 10.6 
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T(x) = 1, 


T(x) = x, 

T(x) = 2x? - 1, 

T(x) = 4x3 - 3x, 

T,(x) = 8x4 — 8x? + 1, 

Ts(x) = 16x> — 20x3 + 5x, 

T(x) = 32x® — 48x4 + 18x? - 1, 

T(x) = 64x7 — 112x5 + 56x3 - 7x, 

T(x) = 128x® — 256x° + 160x4 — 32x? + 1, 

To(x) = 256x9 — 576x7 + 432x5 — 120x3 + 9x, 

Tyo(x) = 512x!° — 1280x8 + 1120x® — 400x4 + 50x? — 1. (10.7) 


The members of this series of polynomials can be generated from the two-term recursion 
formula 


TiCe. = 


Tins), To(x) (10.8) 


Note that the coefficient of x" in 7,(x) is always 2"~'. In Fig. 10.6 we plot the first four 
polynomials of Eq. (10.7). 
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t 
These polynomials have some unusual properties. They form an orthogonal set, in 
that < 
1 0, n#=m, 
I FOV) dx = 7 n=m=0, 
SN ee: n/2, n=m#0. (10.9) 


The orthogonality of these functions will not be of immediate concern to us. 
The Chebyshev polynomials are also terms of Fourier series, since 


T,(x) = cos n@, (10.10) 


where @ = arccos x. Observe that cos 0 = 1, cos @ = cos (arccos x) = x. 
In order to demonstrate the equivalence of Eq. (10.10) to Eqs. (10.7) and (10.8), 
we recall some trigonometric identities, such as 


cos 26 = 2 cos? 6 — 1, 
Tx) = 2x2 - 1; 


4 cos? 6 — 3 cos 6, 
4x3 — 3x; 


cos 34 
Ty) 


cos (n + 1)4 + cos (n — 1) = 2 cos @ cos nO, 


Ty+ (x) + Ty \(x) = 2xT,(x). 


Because of the relation T(x) = cos n6, it is apparent that the Chebyshev polynomials 
have a succession of maximums and minimums of alternating signs, each of magnitude 
one. Further, since |cos n6| = 1 for n@ = 0, 7, 277, . . . , and since 6 varies from 0 to 
7 as x varies from | to —1, 7,,(x) assumes its maximum magnitude of unity n + 1 times 
on the interval [—1, 1]. 

Most important for our present application of these polynomials is the fact that, of 
all polynomials of degree n where the coefficient of x” is unity, the polynomial 


1 
gett) 


has a smaller upper bound to its magnitude in the interval [—1, 1] than any other. Because 
the maximum magnitude of 7,(x) is one, the upper bound referred to is 1/2”~!. This is 
of importance because we will be able to write power-series representations of functions 
whose maximum errors are given in terms of this upper bound. 


10.4 
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We first prove this assertion about bounds on the magnitude of polynomials. The 
proof is by contradiction. Let P,(x) be a polynomial whose leading term* is x” and suppose 
that its maximum magnitude on [—1, 1] is less than that of 7,,(x)/2"~!. Write 


T,(2)/2"-! — P(x) = Py-10, 


where P,_ ;(x) is a polynomial of degree n — | or less, since the x" terms cancel. The 
polynomial 7,(x) has n + 1 extremes (counting endpoints), each of magnitude one, so 
T,(x)/2"~' has n + 1 extremes each of magnitude 1/2"~', and these successive extremes 
alternate in sign. By our supposition about P,(x), at each of these maximums or minimums, 
the magnitude of P,(x) is less than 1/2"~!; hence P,_\(x) must change its sign at least 
for every extreme of 7,,(x), which is then at least n + 1 times. P,— (x) hence crosses the 
axis at least n times and would have n zeros. But this is impossible if P,— ,(x) is only of 
degree n — 1, unless it is identically zero. The premise must then be false and P,(x) has 
a larger magnitude than the polynomial we are testing or, alternatively, P,(x) is exactly 
the same polynomial 


APPROXIMATION OF FUNCTIONS WITH ECONOMIZED 
POWER SERIES 


We are now ready to use Chebyshev polynomials to “economize” a power series. Consider 
the Maclaurin series for e*: 


Fe a xe 


6 124° 120° 707°" ' 


If we would like to use a truncated series to approximate e* on the interval [0, 1] with 
precision of 0,001, we will have to retain terms through that in x°, since the error after 
the term in x° will be more than 1/720. Suppose we subtract 


(ag Ts 
735) 32 
from the truncated series. We note from Eq. (10.7) that this will exactly cancel the x° 
term and at the same time make adjustments in other coefficients of the Maclaurin series. 


Since the maximum value of 7, on the interval [0, 1] is unity, this will change the sum 
of the truncated series by only 


1 1 
770 32 < 0.00005, 


*We restrict the polynomials to those whose leading term is x” so that all are scaled alike. 
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Table 10.3 


which is small with respect to our required precision of 0.001. Performing the calculatic 
we have 


x 6 


S++ 


E 


24 * 120 * 720 


— 48x4 + 18x? — 1), 


3 
1.000043 + x + 0.499219x? + = + 0.043750x* 


ol 


+ 10° (10.11) 


This gives a fifth-degree polynomial that approximates e* on [0, 1] almost as well as the 
sixth-degree one derived from the Maclaurin series. (The actual maximum error of the 
fifth-degree expression is 0.000270; for the sixth-degree expression it is 0.000226.) We 
hence have “economized” the power series in that we get nearly the same precision with 
fewer terms. 

By subtracting qin (Ts/16) we can economize further, getting a fourth-degree poly- 
nomial that is almost as good as the economized fifth-degree one. It is left as an exercise 


to do this and to show that the maximum error is now 0.000781, so that we have found 
a fourth-degree power series that meets an error criterion that requires us to use two | 


additional terms of the original Maclaurin series. Because of the relative ease with which 


they can be developed, such economized power series are frequently used for approxi- | 


mations to functions and are much more efficient than power series of the same degree 


obtained by merely truncating a Taylor or Maclaurin series. Table 10.3 compares the { 


errors of these power series. 


Comparison of errors of economized power series and a Maclaurin series for e* 
ee 


Maclaurin, Economized,  Economized, Maclaurin, 
x as sixth-degree fifth-degree fourth-degree _ fourth-degree 
0 1.00000 1.00000 1.00004 1.00004 1.00000 
0.2 1.22140 1.22140 1.22142 1.22098 1.22140 
0.4 1.49182 1.49182 1.49179 1.49133 1.49173 
0.6 1.82212 1.82211 1.82208 1.82212 1.82140 
0.8 2.22554 2.22549 2.22553 2.22605 2.22240 
1.0 2.71828 2.71806 2.71801 2.71749 2.70833 


Maximum error 0.00023 0.00027 0.00078 0.00995 
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The maximum error in the economized fifth-degree polynomial is only slightly greater 
than the sixth-degree Maclaurin series. The economized fourth-degree polynomial incurs 
a maximum error about three and one-half times as much, but still within the 0.001 limit 
that was initially imposed, and will require significantly reduced computational effort. In 
addition, there is a proportionately reduced memory-space requirement to store the con- 
stants of the polynomial. In contrast, a fourth-degree Maclaurin series has an error nearly 
ten times greater than the 0.001 tolerance, and its error is over twelve times that of the 
fourth-degree economized form. 

By rearranging the Chebyshev polynomials, we can express powers of x in terms of 
them: 


1=h, 
x= 7, 


1 
5(Ty + T), 


1 
4GN +1). 


i 
3% + 4% + T;), 


= sor, + 57; + Ts), 


16 
xo = HUT + 15ST; + 67, + T,). 
= A Osn + 217, + 71; + T), 
x8 = Bah + 567, + 287, + 8% + J), 
v= sag (1267, + 847; + 367; + 9T; + T). (10.12) 


By substituting these identities into an infinite Taylor series and collecting terms in 7;(x), 
we create a Chebyshev series. For example, we can get the first four terms of a Chebyshev 
series by starting with the Maclaurin expansion for e*. Such a series converges more 
rapidly than does a Taylor series on [—1, 1]: 


— pF Sec ate See 
BPS 6G F Grek tag % 


Replacing terms by Eq. (10.12), but omitting polynomials beyond 7;(x), since we want 
only four terms,* we have 


*The number of terms that are employed determines the accuracy of the computed values, of course. 


642 CHAPTER 10: CURVE-FITTING AND APPROXIMATION OF FUNCTIONS 


Table 10.4 


Oy 
* 
I 


fF 1 1 
= Ty + Tit (fo + B) + 54h + T) + 755% + 4% + ---) 


1 1 
+ jp lON + 5G +--+) + py qgp OH + 15% +++) +=" 


1.26617 + 1.13037, + 0.27157 + 0.04447, +--+. 


In order to compare the Chebyshev expansion with the Maclaurin series, we convert 
back to powers of x, using Eq. (10.7): 


1.2661 + 1.1303(x) + 0.2715(2x? — 1) + 0.0444(4x3 — 3x) +--+: 


0.9946 + 0.9971x + 0.5430x? + 0.1776x3 + «++. (10.13) 


Table 10.4 and Fig. 10.7 compare the error of the Chebyshev expansion, Eq. (10.13), 
with the Maclaurin series, using terms through x? in each case. The figure shows how 


the Chebyshev expansion attains a smaller maximum error by permitting the error at the — 


origin to increase. The errors can be considered to be distributed more or less uniformly 
throughout the interval. In contrast to this, the Maclaurin expansion, which gives very 
small errors near the origin, allows the error to bunch up at the ends of the interval. 

If the function is to be expressed directly as an expansion in Chebyshev polynomials, 
the coefficients can be obtained by integration. Based on the orthogonality property, the 
coefficients are computed from 


9 
a, == SOTO 
=v i= 


Comparison of Chebyshev series for e* with Maclaurin series: 
e* = 0.9946 + 0.9971 x + 0.5430x? + 0.1776x?; 


ex= 1+ x + 05x? + 0.1667x? 


x ee Chebyshev Error Maclaurin Error 
-1.0 0.3679 0.3629 0.0050 0.3333 0.0346 
-0.8 0.4493 0.4535 —0,0042 0.4347 0.0146 
—-0.6 0.5488 0.5535 —0.0047 0.5440 0.0048. 
-0.4 0.6703 0.6713 —0.0010 0,6693 0.0010 
-0.2 0.8187 0.8155 0.0032 0.8187 0.0000 

0 1.0000 0.9946 0.0054 1.0000 0.0000 

0.2 1.2214 1.2172 0.0042 1.2213 0.0001 

0.4 1.4918 1.4917 0.0001 1.4907 0.0011 

0.6 1.8221 1.8267 —0.0046 1.8160 0.0061 


0.8 2.2255 2.2307 —0.0052 2.2053 0.0202 
1.0 2.7183 2.7123 0.0060 2.6667 0.0516 


Figure 10.7 


XAMPLE 
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Error of 


Maclaurin 
0.04 + series Ps 


/ 


0.01} / 
ig Error of Chebyshev 
series 


and the series is expressed as 
fa) => + > a, T(x). 
2 


A change of variable wiil be required if the desired interval is other than (—1, 1). In 
some cases, the definite integral which defines the coefficients can be profitably evaluated 
by numerical methods as described in Chapter 4. 

Since the coefficients of the terms of a Chebyshev expansion usually decrease even 
more rapidly than the terms of a Maclaurin expansion, one can get an estimate of the 
magnitude of the error from the next nonzero term after those that were retained. For the 
truncated Chebyshev series given by Eq. (10.13), the 7,(x) term would be 


1 
aoadalire 


192 (67,) + + + + = 0.0052574. 


1 
23,040 
Since the maximum value of 7,(x) on (—1, 1) is 1.0, we estimate the maximum errors 
of Eq. (10.13) to be 0.00525. The maximum error in Table 10.4 is 0.0060. This good 
agreement is caused by the very rapid decrease in coefficients in this example. 

The computational economy to be gained by economizing a Maclaurin series, or by 
using a Chebyshev series, is even more dramatic when the Maclaurin series is slowly 
convergent. The previous example for f(x) = e* is a case in which the Maclaurin series 
converges rapidly. The power of the methods of this section is better demonstrated in the 
following example. 


A Maclaurin series for (1 + x)~! is 
(+a =lax+ 2-8 44-055 (—1 <x <1). 


Table 10.5 compares the accuracy of truncated Maclaurin series with the economized 
series derived from them. 
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Table 10.5 Comparison of Maclaurin and economized series for (1 + x)! 


Maclaurin 


r 
Degree Value Error Degree* Value 


r= 02 


2 0.840000 

4 0.833600 0.000267 2 0.758600 = —0.0747333 

6 0.833344 11 * 107° 4 0.764594 —0.068739 

8 0.833334 1 * 10-6 6 0.803646 —0.029687 

10 0.833333 0 8 0.822786 = —0.010547 

2 0.840000 0.284445 

4 0.737600 0.182045 2 0.812600 0.257045 

6 0.672064 0.116509 4 0.678314 0.122759 

8 0.630121 0.074566 | 6 0.628558 0.073003 4 0.658246 0.102691 
10 0.603277 0.047722 8 0.602106 0.046551 6 0.598199 0.042644 


*Economized series were derived from Maclaurin series of corresponding degree 


In Table 10.5, we see that the error of the Maclaurin series is small for x = 0.2, 
and this also would be true for other values near x = 0, while the economized polynomial 
has less accuracy, At x = 0.8, the situation is reversed, however. Economized polynomials 
of degrees 8 and 6, derived from truncated Maclaurin series of degrees 10 and 8, actually 
have smaller errors than their precursors. Further economization, giving polynomials of 
degrees 6 and 4, have lesser or only slightly greater errors than their precursors, at 
significant savings of computational effort and with smaller storage requirements in a 
computer's memory for the coefficients of the polynomials. = 


10.5 APPROXIMATION WITH RATIONAL FUNCTIONS 


We have seen that expansion of a function in terms of Chebyshev polynomials gives a 
power-series expansion that is much more efficient on the interval (—1, 1) than the 
Maclaurin expansion, in that it has a smaller maximum error with a given number of 
terms. These are not the best approximations for use in most digital computers, however. 
In this application, we measure efficiency by the computer time required to evaluate the 
function, plus some consideration of storage requirements for the constants. Since the 
arithmetic operations of a computer can directly evaluate only polynomials, we limit our 
discussion of more efficient approximations to rational functions, which are the ratios of 
two polynomials. 
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Our discussion of methods of finding efficient rational approximations will be ele- 
mentary and introductory only. Obtaining truly best approximations is a difficult subject. 
In its present stage of development it is as much art as science, and requires successive 
approximations from a “suitably close” initial approximation. Our study will serve to 
introduce the student to some of the ideas and procedures used. The topic is of great 
importance, however, since the saving of just 1 msec of time in the generation of a 
frequently used elementary function may save hundreds of dollars’ worth of machine 
time each year. 

We start with a discussion of Padé approximations. Suppose we wish to represent a 
function as the quotient of two polynomials: 


ay + ayx + agx? + +++ +a,x" 


fixe) = Ry) = N=n+m. 


1+ bx + by? +--+ +b, 


The constant term in the denominator can be taken as unity without loss of generality, 
since we can always convert to this form by dividing numerator and denominator by bp. 
The constant bg will generally not be zero, for, in that case, the fraction would be undefined 
at x = 0. The most useful of the Padé approximations are those with the degree of the 
numerator equal to, or one greater than, the degree of the denominator. Note that the 
number of constants in R(x) isN + 1=n+m+l. 

The Padé approximations are related to Maclaurin expansions in that the coefficients 
are determined in a similar fashion to make f(x) and Ry(x) agree at x = 0 and also to 
make the first V derivatives agree at x = 0.* 

We begin with the Maclaurin series for f(x) (we use only terms through x‘) and 
write 


f(x) — Ry) 
a Oot Akt esha 
= (cp  Gyx + gx? + +++ + cyx¥) — SE ____ + 
a . * 1+ kt 0864+ b x" 
_ (eq + yx + + + + cyxN CL + yx + + + + + b,x) — (ay + ayx + +++ + 0,2") 
T+ bx toe + b,x” 


(10.14) 


The coefficients c; are f‘"(0)/(i!) of the Maclaurin expansion. Now if f(x) = Ry(x) at 
x = 0, the numerator of Eq. (10.14) must have no constant term. Hence 


Co — ag = 0.7 (10.15) 


*A similar development can be derived for the expansion about a nonzero value of x, but the manipulations 
are not as easy. By a change of variable we can always make the region of interest contain the origin. 
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In order for the firstV derivatives of f(x) and Ry(x) to be equal at x = 0, the coefficien 
of the powers of x up to and including x” in the numerator must all be zero also. Th 
gives N additional equations for the a’s and b’s. The first n of these involve a’s, the 
only b’s and c’s: 

byco + cy — a, = 0, 

byco + bycy + C2 — ay = 0, 

byco + bycy + bycy + cz — a3 = 0, 


Drifn—m + Pm—1Cn—m+1 + °° * + Cy — An = 0, 
OmCn~ mei + Om 1Cn—mea + °°? F Cay, = 0, 
bmCn-m+2 + Bn—1Cn—me3 + °° * + Care = 0, 
Bntnom + On=ien=mer + * + Cy = 0. (10.16) 


Note that, in each equation, the sum of the subscripts on the factors of each product is 
the same, and is equal to the exponent of the x-term in the numerator. The N + 1 equations 
of Eqs. (10.15) and (10.16) give the required coefficients of the Padé approximation. We 
illustrate by an example. 


—©xX AMPLE Find arctan x = Ro(x). Use in the numerator a polynomial of degree five. 
The Maclaurin series through x° is 


arctan x = x — ie + is - iy + Be. (10.17) 


We form, analogously to Eq. (10.14), 
£0) ~ Rox) 


(x- te + bys - ty + ea $+ yx + bax? + Bsr? + bax) = (ay + yx + 26+ + asx) 


(L + byx + byx? + b3x? + byxt) 


4 
(10.18) | 


{ 
Making coefficients through that of x? in the numerator equal to zero, we get 


a = 0, 
a; = 1, 
a, = by, 
a — -1 +5, 


Table 10.6 
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ay = -4, + by, 


D1 
as = 5 — 3b2 + by, 


501 — 33 = 9, 

1 
—7 + 5b. — 36s = 0, 
~ 5, + tb =0, 

1 ot 1 
9 ~ 7O2 + 5b4 = 0. 


64 
as = 


Ss ~ 945° 


i 
S 
5 

iT 
° 
S 

& 
Wl 
win 
s 
é 
i} 
ad 


ay = 0, a 


b 


n 

= 

= 
fl 
| 


b, = 0, by = 


A rational function that approximates arctan x is then 


arctan x = ———~]7>——-—. (10.19) 


In Table 10.6 we compare the errors for Padé approximation (Eq. 10.19) to the Maclaurin 
series expansion (Eq. 10.17). Enough terms are available in the Maclaurin series to give 
five-decimal precision at x = 0.2 and 0.4, but at x = 1 (the limit for convergence of the 


Comparison of Padé approximation to Maclaurin series for arctan x 
a 


True Padé Maclaurin 
x Value Eq. (10.19) Error Eq. (10.17) Error 
0.2 0.19740 0.19740 0.00000 0.19740 0.00000 
0.4 0.38051 0.38051 0.00000 0.38051 0.00000, 
0.6 0.54042 0.54042 0.00000 0.54067 —0.00025 
0.8 0.67474 0.67477 —0.00003 0.67982 —0.00508 


1.0 0.78540 0.78558 —0.00018 0.83492 —0.04952 


F 


series) the error is-sizable. Even though we used no more information in establishing it, 
the Padé formula is surprisingly accurate, having an error only 1/275 as large at x = 1. 

It is then particularly astonishing to realize that the Padé approximation is still not the 
best one of its form, for it violates the minimax principle. If the extreme precision near 
x = 0 is relaxed, we can make the maximum error smaller in the interval. = 
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. 


} 

Before we discuss such better approximations in the form of rational functions, 
remarks on the amount of effort required for the computation using Eq. (10.19) are in } ‘ 
order. If we implement the equation in a computer as it stands, we would, of course, 
use the constants in decimal form, and we would evaluate the polynomials in nested 
form: | 


Numerator = [(0.0677x2 + 0.7778)x? + 1]x, 
Denominator = (0.2381x? + 1.1111)? + 1. 


i} 


Since additions and subtractions are generally much faster than multiplications or 
divisions, we generally neglect them in a count of operations. We have then three mul- 
tiplications for the numerator, two for the denominator, plus one to get x?, and one 
division, for a total of seven operations. The Maclaurin series is evaluated with six 
multiplications, using the nested form. If division and multiplication consume about the _ 
same time, there is about a standoff in effort, but greater precision for Eq. (10.19).* 

Since small differences in effort accumulate for a frequently used function, it is of 
interest to see if we can further decrease the number of operations to evaluate Eq. (10.19), | 
By means of a succession of divisions we can re-express it in continued-fraction form: 

0.0677x5 + 0.7778x3 + x _ 0,2844x5 + 3.2667x3 + 4.2x | 

0.23814 + Lb? + 1 xt + 4.6667 : 
0.2844x(x4 + 11.4846x? + 14.7659) 


+420 


x? .6667x° + 4.2 
= 0.284% - 
(xt + 4.6667x? + 4.2)/(x* + 11.4846x? + 14.7659) 
0,2844x 
~ 1 = (6.81792 + 10.5659)/(x* + 11,4846x7 + 14.7659) 
as 2 ae oe 
1 — 6.8179(x? + 1.5497)/(x* + 11.4846x? + 14.7659) 
0.2844x 
~ 1 — 6.8179/[* + 11.4846x2 + 14.7659)/(a2 + 1.5497) | 
0.2844x 


~ 1 = 6.8179/[x? + 9.9348 — 0.6304/(x? + 1.5497)] 


In this last form, we see that three divisions and two multiplications are needed (one, 
multiplication by x and one to get x), for a total of five operations. We have saved two- 
steps. In most cases there is an even greater advantage to the continued-fraction form; 
in this example the missing powers of x favored the evaluation as polynomials. 


*On many computers, division is slower than multiplication. This will modify the conclusion reached here. 
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The error of a Padé approximation can often be roughly estimated by computing the 
next nonzero term in the numerator of Eq. (10.19). For the above example, the coefficient 


of x! is zero, and the next term is 
! eee er) -1(3) (2) ee iT, 
( 7bs + gb ig [ Wai) too) nf 


" 


H 
I 
° 
S 


Dividing by the denominator, we have 


—0.0014x"! 


~ 7+ TANI? + 0.23814" 


At x = I this estimate gives —0.00060, which is about three times too large, but still of 
the correct order of magnitude. It is not unusual that such estimates be rough; analogous 
estimates of error by using the next term in a Maclaurin series behave similarly. The 
validity of the rule of thumb that “next term approximates the error” is poor when the 
coefficients do not decrease rapidly. 

The preference for Padé approximations with the degree of the numerator the same 
as or one more than the degree of the denominator rests on the empirical fact that the 
errors are usually less for these. Ralston (1965) gives examples demonstrating this. 

One can get somewhat improved rational-function approximations by starting with 
the Chebyshev expansion and operating analogously to the method for Padé approxi- 
mations. We illustrate with an approximation for e*. The Chebyshev series was derived 
in Section 10.4, Eq. (10.13): 


= 1.26617, + 1.13037, + 0.27157, + 0.04447,. 


Using this approximation, we form the difference 


) Pix) _ (1.2661 + 1.3037, + 0.27157, + 0.044471 + byT) — (ag + aT, + ash) 
Fe) Q,,00) Lt bT, 


Here we have chosen the numerator as a second-degree Chebyshev polynomial and the 
denominator as first degree. We again make the first N = n + m powers of x in the 
numerator vanish. Expanding the numerator, we get 


Numerator = 1.2661 + 1.13037, + 0.27157, + 0.04447; + 1.2661b,7, 
+ 1.1303b,7} + 0.2715b,T,T; + 0.0444b,T,T, — ao 


— aT, — aly. 
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Before we can equate coefficients to zero, we need to resolve the products of Cheby: 
polynomials that occur. Recalling that 7,(x) = cos n@, we can use the trigonot 
identity 


cos n@ cos m6 = Sleos(n + m)@ + cos(n — m)6], 


T,(x)T, (x) = Uae nl + Thn—m|())- 


The absolute value of the difference n — m occurs because cos (z) = cos (—z). 
Using this relation we can write the equations 


dy = 1.2661 + 1», 
7 
a; = 1.1303 + (a8 te 1.2661), 
1.1303 , 0.0444 
ay = 0.2715 + ( so + aa 
by 
0 = 0.0444 + 27715, 


Solving, we get b, = —0.3266, ag = 1.0815, a, = 0.6724, a, = 0.07966, and 
1.0815 + 0.67247, + 0.079667, 


xs 
£ 1 — 0.32667, ’ 
i +0. +0. Z 
ex = 10018 ems 0 1593x7 Goat 


The last expression results when the Chebyshev polynomials are written in terms of 
powers of x. In Table 10.7 the error of this rational approximation is compared to the 
Chebyshev expansion. We see that the maximum error is reduced by 22%. Note that we 
do not, nevertheless, yet have a “best approximation.” The error should reach equal 
maximums at five points in the interval—instead the error is large near x = 1 and too 
small elsewhere. 

The basis for this last statement is the minimax theorem. Based on a theorem due 
to Chebyshev, we may state a principle whereby we may determine whether the approx- 
imation represented by a given polynomial or rational function is optimum, in the sense 
that it gives the least maximum error of any rational function of the same degree of 
numerator and denominator on a given interval. An expression is “minimax” if and only 
if there are at least n + 2 maxima in the deviations, and these are all equal in magnitude 
and of alternating sign, on the interval of approximation. (Here n is the sum of the degrees 
of numerator and denominator of the rational function.) In the discussion that follows, 


Table 10.7 
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Comparison of rational approximations (Eq. (10.20)) with Chebyshev series for e* 
en Sv na Re ES So = St ec 


Rational 
x a Chebyshev Error function Error 
=1.0 0.3679 0.3629 0.0050 0.3684 —0.0005 
—0.8 0.4493 0.4535 —0.0042 0.4486 0.0007 
—0.6 0.5488 0.5535 —0.0047 0.5482 0.0006 
—0.4 0.6703 0.6713 —0.0010 0.6707 —0.0004 
—-0.2 0.8187 0.8155 0.0032 0.8201 —0.0014 
0 1.0000. 0.9946 0.0054 1.0018 —0.0018 
0.2 1.2172 0.0042 —0.0011 
0.4 1.4917 0.0001 0.0007 
0.6 1.8267 —0.0046 0,0030 
0.8 2.2307 —0.0052 0.0031 
1.0 2.7123 0.0060 —0,0047 


we shall be referring to the magnitude of the errors and will not be concerned about their 
sign. 

A second important consequence of this principle is that we can put bounds on the 
error of the minimax expression from the range of the errors of a function that is not 
minimax. Suppose we have an expression like Eq. (10.20), It has five maxima of alter- 
nating sign on the interval [—1, 1], as shown by Table 10.7. This is the correct number, 
but we know the function isn’t minimax because the maxima aren’t equal in magnitude. 
While some of the maxima are not given precisely by the table, we can see that the 
smallest is 0.0005 (at x = —1) and the largest appears to be 0.0047 (at x = 1). From 
this range of maximum error values, we can bound the maximum error of the minimax 
rational function [of degree (2, 1)]: the minimax expression will have a maximum error 
on [—1, 1] no less than 0.0005 and no greater than 0.0047. 

Similarly, the truncated Chebyshev series of degree 3 whose errors are tabulated in 
Table 10.4 is not truly minimax; it has five alternating maxima to its errors but they are 
not quite equal in magnitude. From an examination of Table 10.4 we can say, however, 
that the minimax polynomial of degree 3 will have a maximum error bounded by 0.0047 
and 0.0060. (A tighter bound might result from a more careful computation of the errors. 
of Eq. (10.13).) Such a prediction about the amount of improvement that will be provided 
by a minimax expression can help one decide whether the additional effort to find it is 
worthwhile. 

To obtain the optimum rational function that approximates the function with equal- 
magnitude errors distributed through the interval is beyond the scope of this text. The 
approach that is used is to improve an initial estimate of a function, such as Eq. (10.20), 
by successive trials, often modifying the constants on the basis of experience until even- 
tually one has a satisfactory formula. Systematic methods of determining the constants 
in such minimax rational approximations have also been determined. They are iteration 
methods beginning from an initial “sufficiently good” approximation. They are expensive 
to compute because the iterations involve solving a set of nonlinear equations. Ralston 
(1965) describes one such method. Prenter (1975) discusses the approximation of functions 
of several variables. 
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10.6 APPROXIMATION OF FUNCTIONS WITH TRIGONOMETRIC 
SERIES: FAST FOURIER TRANSFORMS 


In nearly all of our discussions until now we have approximated functions by polynomials, 
but for many functions it is better to use trigonometric functions as the basis. This is 
particularly true when f(x) is periodic or when it has discontinuities. A trigonometric 
series, also called a Fourier series, is another means for solving some partial-differential 
equations. Getting the coefficients of a trigonometric or Fourier series can be computa- 
tionally expensive, but the fast Fourier transform (FFT) is a way to minimize the effort. 
In this section, we will first describe the Fourier series and then develop the FFT procedure. 

Recall what is meant by a periodic function; f(x) is periodic with the period T if 
IO) =Se FT: 


EXAMPLES 1. sin (x) and cos (x) are periodic with period 27, since sin (x) = sin (x + 27) 
and cos (x) = cos (x + 27). 


2. sin (4x) is periodic with period 7/2. 
3. This function has a period of 2: 


4. This function has a period of 10: 


to) 10 20 30 


It will be convenient to assume that the period T is always 27 in the following. 
There is no loss of generality in this because, if f(x) = f(x + T), we can define g(x) = 
S(xT/277), where g(x) has period 27. 

If f(x) is periodic of period 27 and obeys certain assumptions, we can represent it 
as a Fourier series: 


f(x) = ag/2 + 2 [a, cos (kx) + by sin (kx)]. (10.21) 
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An alternative form of this series uses complex exponentials. Recall that a complex 
number z can be written as a + bi, where a and b are real and i = V—1, soi? = —1. 
Euler’s identity relates sines and cosines to the exponential form: 


e¥* = cos (jx) + i sin (jx), 


and this permits us to write Eq. (10.21) as 


ey a 
SR) = 2. Ge" + e_je"*) 
J=0 


2g + > [(c + ¢_,) cos (jx) + i(c; — c-,) sin (jx) 


=. SS gait 


(10.22) 
J 
We can match up the a’s and b’s of Eq. (10.21) to the c's of (10.22): 
a; = Cc + CH js b = Uc; = Cin; 
a; — ib a; + ib 
gas, ey et. (10.23) 


For f(x) real, it is easy to show that cg = yg and c, = ¢_;, where the bars represent 
complex conjugates. 
For integers j and k, it is true that 


2a Qn — 
I (ee) dx = I eNktDX dy = 0 for k = Js 
0 0 2a fork = -j. 


(You can verify the first of these through Euler's identity.) This allows us to evaluate the 
c’s of Eq. (10.22) by the following. 
For each fixed k, we get 


f(x) ew = Dy cyel-%s, 


f(x) e7 ® dx = 2a cy, or 


I 


Lf 
Ce = x, f(x) oO ™ dx, k= 0, +1, +2,.... (10.24) 


<AMPLES (You should verify each of these.) 


1. Let f(x) = x; then 


2. Let f(x) = x(27 — x); then 
1 2 
ck On i x(2a — x)e~** dx = reo k#0. 
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3. Let f(x) = cos (x); then 
“e 


fork = 1 or -1, 


het 1 
Ce = 55 [ cos (xje* dx = $2 
2m10 0 for all other k. 


Note that for Eq. (10.21) this makes a, = 1 and all the other a;’s = 0. Thus, fora 
given f(x) that satisfies continuity conditions, we have 


1 2" eo : 
o- x], fperh dy jf’ 0; S182, ss: . 


The magnitudes of the Fourier series coefficients |c;| are the power spectrum of f; 
these show the frequencies that are represented in f(x). If we know f(x) in the time { 
domain, we can identify f by computing the c;’s. In getting the Fourier series, we have { 
transformed from the time domain to the frequency domain, an important aspect of wave t 


analysis. 
Suppose we have N values for f(x) on the interval [0, 27] at equispaced points, | 
Xx, = 2mk/N,k=0,1,..., N — 1. Since f(x) is periodic, fy = fo, fyv+1 = fi, and so 4 


on. Instead of formal analytical integration, we would use a numerical integration method. 4 
to get the coefficients. Even if f(x) is known at all points in [0, 277], we might prefer to 
use numerical integration. This would use only certain values of f(x), often those evaluated { 
at uniform intervals. It is also often true that we do not know f(x) everywhere, because 
we have sampled a continuous signal. In that case, however, it is better to use the discrete 
Fourier transform, which can be defined as 


X(n) = BY xolk) Pm, ; (10.25) 


In (10.25), we have changed notation to conform more closely to the literature on 
FFT. X(n) corresponds to the coefficients of N frequency terms, and the x(k) are the N 
values of the signal samples in the time domain. You can think of n as indexing the X- 
terms and k as indexing the xp-terms. Equation (10.25) corresponds to a set of N linear 
equations that we can solve for the unknown X(n). Since the unknowns appear on the 
left-hand side of (10.25), this requires only the multiplication of an NV component vector 
by an N x N matrix. 

It will simplify the notation if we let W = e~ 27”, making the right-hand-side terms 
of Eq. (10.25) become xo(k)W"*. To develop the FFT algorithm, suppose that N = 4. 
We write the four equations for this case: 


X(0) = Wx9(0) + Wxo(1) + W2x(2) + W°x9(3), 
X(1) = Wx9(0) + Wlxo(1) + W2x9(2) + W3x9(3). 
X(2) = Wx9(0) + W2x9(1) + W4x9(2) + W°x9(3), 
X(3) = Wx(0) + W3xq(1) + WSx9(2) + W2x9(3). 


II 
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In matrix form: 


Xo] [wo we we wo 
x,|_|w® w' w2 w3 
X,| |w® w2 wt wee 
x.) lw° ws we w? 


(10.26) 


In solving the set of N equations in the form of Eq. (10.26), we will have to make 
N? complex multiplications plus M(N — 1) complex additions. The value of the FFT is 
that the number of such operations is greatly reduced. While there are several variations 
on the algorithm, we will concentrate on the Cooley—Tukey formulation. The matrix of 
Eq. (10.26) can be factored to give an equivalent form for the set of equations, At the 
same time we will use the fact that W° = 1 and Wk = Wé modi), 


X(0) 1 w® 00 10 w® o 
X(2) 1 wooo 1 0 w® 
xa} foo 1 wi o we o Po (10.27) 
X(3), 0 0 1 wijlo 1 0. w2 


You should verify that the factored form (10.27) is exactly equivalent to Eq. (10.26) 
by multiplying out. Note carefully that the elements of the X vector are scrambled. (The 
development can be done formally and more generally by representing n and k as binary 
values, but it will suffice to show the basis for the FFT algorithm by expanding on this 
simple N = 4 case.) 

By using the factored form, we now get the values of X(n) by two steps (stages), in 
each of which we multiply a matrix times a vector. In the first stage, we transform xo 
into x, by multiplying the right matrix of (10.27) and xp. In the second stage, we multiply 
the left matrix and x,, getting x». We get X by scrambling the components of x,. By 
doing the operation in stages, the number of complex multiplications is reduced to 
N(log,N). For N = 4, this is a reduction by one-half, but for large N it is very significant; 
if N = 1024, there are 10 stages and the reduction in complex multiplies is a hundredfold! 

It is convenient to represent the sequence of multiplications of the factored form (Eq. 
(10.27) or its equivalent for larger N) by flow diagrams. Figure 10.8 is for N = 4 and 
Fig. 10.9 is for N = 16. Each column holds values of xs;, where the subscript tells which 
stage is being computed; ST ranges from | to 2 for N = 4 and from | to 4 for N = 16. 
(The number of stages, for N a power of 2, is log,(N).) In each stage, we get x-values 
of the next stage from those of the present stage. Every new x-value is the sum of the 
two x-values from the previous stage that connect to it, with one of these multiplied by 
a power of W. The diagram tells which xgy terms are combined to give an xs74, term, 
and the numbers shown within the lines are the powers of W that are used. For example, 
looking at Fig. 10.9, we see that 


X2(6) = x,(2) + W8 x,(6), 
x3(13) = x,(13) + W® x,(15), 
x4(9) = 24(8) + W9 x3(9), 


and so on. 


656 CHAPTER 10: CURVE-FITTING AND APPROXIMATION OF FUNCTIONS 


Figure 10.8 


~ x 
(0) is) 
0 
7 aoa | 2 
2 ! 
| 
ee} 3 


The last columns in Figs. 10.8 and 10.9 indicate how the final x-values are unscram- 
bled to give the X-values. This relationship can be found by expressing the index k of x 
in the last stage as a binary number and reversing the bits; this gives n in X(n). For 
example, in Fig. 10.9, we see that x,(3) = X(12) and x,(11) = X(13). From the bit- 
reversing rule, we get 


3 = 0011, 1100, = 12, 11 = 1011; > 1101, = 13. 


Observe also that the bit-reversing rule can give the powers of W that are involved 
in computing the next stage. For the last stage, the powers are identical to the numbers. 
obtained by bit reversal. At each previous stage, however, only the first half of the powers 
are employed, but each power is used twice as often. It is of interest to see how we can 
generate these values. Computer languages that facilitate bit manipulations make this an 
easy job, but there is a good alternative. Observe how the powers in Fig. 10.8 differ 
from those in Fig. 10.9 and how they progress from stage to stage. The following table 
pinpoints this. 


Stage N=4 N= 10: 
a 0.0.2.2 00 0 0 2 O. 0) 0 Bak 88 8 sense 
2: 0243 000 08 § 8 844 4 4 2 12) 12 oe 
a 0 0 8 8 4. 4 12° 1202 2 10 10 6 614 
4: OSE Ue 2 es Ta 5. sa eae eee 


Can you see what a similar table for N = 2 would look like? Its single row would 
be 0 1. Now we see that the row of powers for the last stage can be divided into two 
halves with the numbers in the second half always 1 greater than the corresponding entry 
in the first half. The row above is the left half of the current row with each value repeated. 
This observation leads to the following algorithm. 


601 e414 


SI 
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EXAMPLE 


~ 


Algorithm to Generate Powers of W in FFT 
For N a power of 2, let @ = log,(N). 


Initialize an array P of length N to all zeros. 
Let ST = 1. 


Double the values of P(K) for K = 1. . 28T!, 
Let each P(K + 2ST!) = P(K) + 1 forK =0.. 2ST-1 — 4, 
Increment ST, 

Until ST > Q. 


The successive new values for powers of W are now in array P. 


Use the algorithm to generate the powers of W for N = 8: 


Q = log,(8) = 3 

a 

K: 0 1 74 3 4 

Initial P array: 0 0 0 0 0 

ST = 1, doubled: 0 0 0 0 0 

add 1: 0 1 0 0 0 

ST = 2, doubled: 0 2 0 0 0 

add 1: 0 2 1 3 0 

ST = 3, doubled: 0 4 2 6 0 

add 1: 0 4 z 6 1 


The last row of values corresponds to the bits of 000 to 111 after reversal. } 

Our discussion has assumed that N is a power of 2; for this case the economy of the 
FFT is a maximum. When N is not a power of 2 but can be factored, there are adaptations | 
of the general idea that reduce the number of operations, but they are more than Nlog(N), } 
See Brigham (1974) for a discussion of this as well as a fuller treatment of the theory | 
behind FFT. 

More recently there has been interest in another transform, called the discrete Hartley | 
transform, A discussion of this transform would parallel our discussion of the Fourier 
transform. Moreover, it has been shown that this transform can be converted into a Fast | 
Hartley Transform (FHT) that reduces to Nlog,(N) computations. For a full coverage of | c 
the FHT, one should consult Bracewell (1986). The advantages of the FHT are that it is 1 
usually faster than the FFT. Moreover, it is easy to compute the FFT from the Hartley 
transform. However, the main power of the FHT is that all the computations are done — 
in real arithmetic, so that one can use a language like PASCAL that does not have a ~ 
complex data type. An interesting and easy introduction into the FHT is found in O'Neill 
(1988). = 


10.7 
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CHAPTER SUMMARY 


Here’s what you are able to do if you understand the topics of the chapter: 


1. Explain what is meant by the least-squares criterion and use it to derive the normal 
equations for the coefficients of a least-squares line. 


2. Apply the least-squares criterion to find the coefficients of a polynomial that fits a 
set of data points. You should be able to show how the design matrix is related to 
the normal equations. 


3. Get the coefficients of other functions than polynomials by the least-squares procedure 
and recognize why this is more difficult. 


4. Explain how to get a Chebyshev polynomial of degree n, what is meant by a set of 

orthogonal functions, and what is the most important property of the Chebyshev 

polynomials. 

Use Chebyshev polynomials to economize a power series and explain why such 

approximations are preferred in computers. 

6. Obtain the coefficients of a Padé approximation and describe how one can get a 
further improvement in efficiency. You can explain the minimax principle. 

7. Tell how the coefficients of a trigonometric series can be evaluated and outline the 
FFT method for fitting to a set of equispaced data. 


8. Use computer programs like those of this chapter. 


SELECTED READINGS FOR CHAPTER 10 


Conte and de Boor (1980); Fike (1968); Brigham (1974); Ramirez (1985). 


COMPUTER PROGRAMS 


Three computer programs are presented. Program | (Fig. 10.10) finds least-squares 
polynomials that fit a set of x,y-pairs. The program reads in the coordinates of the points 
and computes the coefficients of the normal equations for polynomials whose degree 
ranges from MS to MF. Values of these parameters are also read in. To get these normal 
equations, the terms of the largest augmented matrix (Eq. (10.5)) are computed. (This 
matrix corresponds to degree MF). In the process, the powers of x are stored in a vector; 
this expedites forming the sums. The program recognizes the symmetry of the coefficient 
matrix, which lessens the need to recompute terms. 

After the matrix that represents the normal equations has been formed, a subroutine 
is called to compute the LU equivalent. This subroutine is very similar to one in Chapter 
2, except that pivoting is not done. The solution for the coefficients of each of the 
polynomials is obtained by calling a second subroutine. Using the LU decomposition 
process avoids triangularizing the normal equations repeatedly for the various degrees of 
polynomial. The values of Eq. (10.6), here called BETA, are computed and printed to 
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PROGRAM LEASQOR (INPBT, OUTPUT) 


THIS PROGRAM IS USED IN FITTING A POLYNOMIAL TO A SET OF DATA. 
THE PROGRAM READS IN N PAIRS OF X AND Y VALUES AND COMPUTES THE 


COEFFICIENTS OF THE NORMAL EQUATIONS FOR THE LEAST SQUARES 
METHOD. 


PARAMETERS ARE : 


X, Y - ARRAY OF X AND Y VALUES 

N - NUMBER OF DATA PAIRS 

MS,MF - THE RANGE OF DEGREE OF POLYNOMIALS TO BE COMPUTED 
THE MAXIMUM DEGREE IS 9. 


A - AUGMENTED ARRAY OF THE COEFFICIENTS OF THE NORMAL 
EQUATIONS. 

c - ARRAY OF COEFFICIENTS OF THE LEAST SQUARES 
POLYNOMIALS. 


aAaaAnNAAAAAARAAAAAAAAAAAAA 


REAL X(100),¥(100),C(100),A(10,11),XN(100) ,SUM, BETA 
INTEGER N,MS,MF,MFP1,MFP2,1I,J,1IM1, IPT, ICOEF, JCOEF 


c 
@ Secession ewe tee et ee ee 
c 
C READ IN N, THEN THE X AND Y VALUES. 
c 
c READ *, N,( X(I),¥(I), I = 1,N ) 
le 
DATA N/11/ 
DATA X/0.05,0.11,0.15,0.31,0.46,0.52,0.7,0.74,0.82, 
* 0.98,1.17,89*0.0/ 
DATA ¥/0.956, 0.89, 0.832, 0.717, 0.571, 0.539, 0.378, 
* 0.37, 0.306, 0.242, 0.104, 89*0.0/ 
c 
C READ IN MS,MF. THE PROGRAM WILL FIND COEFFICIENTS FOR EACH 
C DEGREE OF POLYNOMIAL FROM DEGREE MS TO DEGREE MF. 
c 
c READ *, MS,MF 
6 
DATA MS,MF/1,7/ 
é ee ee ee Se eee eee 
c 
C COMPUTE MATRIX OF COEFFICIENTS AND R.H.S. FOR MF DEGREE. 
C HOWEVER, FIRST CHECK TO SEE IF MAX DEGREE REQUESTED IS TOO 
C LARGE. IT CANNOT EXCEED N-1. IF IT DOES, REDUCE TO EQUAL N-1 
C AND PRINT MESSAGE. 
S 
IF ( MF .GT. (N-1) ) THEN 
MF =N-1 
PRINT 200, MF 
END IF 
5 MFP1 = MF + 1 
MEP2 = MF + 2 
c 
pen oe Se EN RR ret Satie ae BARE Ue a dee fee es 
c 


Figure 10.10 Program 1. 


c 
c 
c 


anaanaa 


aaaaaa 


aaaaa aanaaaa 
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10.10 (continued) 


PUT ONES INTO A NEW ARRAY. THIS WILL HOLD THE POWERS OF 
THE X VALUES AS WE PROCEED. 


po 10 I 
XN(I) 
10 CONTINUE 


1,N 
0 


COMPUTE FIRST COLUMN AND N+1 ST COLUMN OF A. I MOVES DOWN THE 
ROWS, J SUMS OVER THE N VALUES. 


DO 30 I = 1,MFP1 


A(I,1) = 0.0 
A(I,MFP2) = 0.0 
Do 20 J = 1,N 


A(I,1) = A(I,1) + XN(J) 
A(I,MFP2) = A(I,MFP2) + Y(J)*XN(J) 
XN(J) = XN(J) * X(J) 

20 CONTINUE 

30 CONTINUE 


COMPUTE THE LAST ROW OF A. I MOVES ACROSS THE COLUMNS, J 
SUMS OVER THE N VALUES. 


DO 50 I = 2,MFP1 
A(MEP1,I) = 0.0 
DO 40 J = 1,N 
A(MFP1,I) = A(MFP1,I) + XN(J) 
XN(J) = XN(J) * X(J) 
40 CONTINUE 
50 CONTINUE 


NOW FILL IN THE REST OF THE A MATRIX. I MOVES DOWN THE ROWS, 
J MOVES ACROSS THE COLUMNS. 


DO 70 J = 2,MFP1 
DO 60 I = 1,MF 
A(I,dJ) = A(I+1,d-1) 
60 CONTINUE 
70 CONTINUE 


WRITE OUT THE MATRIX OF NORMAL EQUATIONS. 


PRINT ' (///)¢ 
PRINT *, ' 
PRINT '(/)' 
PRINT 201, ((A(I,J), J=1,MFP2), I=1,MFP1) 
PRINT '(//)' 


THE NORMAL EQUATIONS ARE: 


NOW CALL A SUBROUTINE TO SOLVE THE SYSTEM. DO THIS FOR EACH 
DEGREE FROM MS TO MF. GET THE LU DECOMPOSITION OF A. 


CALL LUDCMQ(A,MFP1,10) 
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Figure 


€ 
c 


aan020aq aanaa 


aaagAaAAAAAADANA 


a 


10.10 (continued) 


o 
RESET THE R.H.S. INTO C. WE NEED TO DO THIS FOR EACH DEGREE. 


MSP1 = MS + 1 
DO 95 I = MSP1,MFP1 
DO 90 J =1,1 
C(J) = A(J,MFP2) 
90 CONTINUE 
CALL SOLNQ(A,C,I,10) 
IMl =I-1 


NOW WRITE OUT THE COEFFICIENTS OF THE LEAST SQUARE POLYNOMIAL. 
PRINT 202, IM1, ( C(J), J=1,1) 
COMPUTE AND PRINT THE VALUE OF BETA = SUM OF DEV SQUARED / 
({N-M~1). 
BETA = 0.0 
DO 94 IPT = 1,N 
SUM = 0.0 


DO 93 ICOEF = 2,1 
JCOEF = I - ICOEF + 2 
SUM = ( SUM + C(JCOEF) ) * X(IPT) 
93 CONTINUE 
SUM = SUM + C(1) 
BETA = BETA + ( Y(IPT) - SUM )**2 
94 CONTINUE 
BETA = BETA / (N - I) 
PRINT 203, BETA 
95 CONTINUE 


200 FORMAT(//’ DEGREE OF POLYNOMIAL CANNOT EXCEED N - 1.’,/ 
+ ‘ REQUESTED MAXIMUM DEGREE TOO LARGE - ’, 
+ "REDUCED TO ’,13) 
201 FORMAT (1X, 9F8.2) 
202 FORMAT(/’ FOR DEGREE OF ‘,12,’ COEFFICIENTS ARE’ // 
+ ' ,5X,11F9.3) 
203 FORMAT(9X,’ BETA IS ’,F10.5//) 
STOP 
END 
SUBROUTINE LUDCMQ(A,N,NDIM) 


SUBROUTINE LUDCMQ : 

THIS SUBROUTINE FORMS THE LU EQUIVALENT OF 
THE SQUARE COEFFICIENT MATRIX A. THE LU, IN COMPACT FORM, IS 
RETURNED IN THE A MATRIX SPACE. THE UPPER TRIANGULAR MATRIX U 
HAS ONES ON ITS DIAGONAL - THESE VALUES ARE NOT INCLUDED IN 
THE RESULT. 


REAL A(NDIM,NDIM),SUM 
INTEGER N,NDIM,1I,J,JM1,1M1,K 


10.8: COMPUTER PROGRAMS 


Figure 10.10 (continued) 


aaaaa 


aaaaaaaaaaaAa 


SUM = SUM + A(I,K)*A(K,J) 
10 CONTINUE 

A(I,J) = A(I,J) - SUM 

ELSE 
IMl =1I-1 
IF ( IM1 .NE. 0 ) THEN 
DO 20 K = 1,IM1 
SUM = SUM + A(I,K)*A(K,J) 


20 CONTINUE 
END IF 
TEST FOR SMALL VALUE ON THE DIAGONAL 
25 IF ( ABS(A(I,I)) .LT. 1.0E-10 ) THEN 
PRINT 100, I 
RETURN 
ELSE 
A(I,J) = ( A(I,J) - SUM) / A(I,T) 
END IF 
END IF 
30 CONTINUE 
RETURN 
100 FORMAT(’ REDUCTION NOT COMPLETED BECAUSE SMALL VALUE’, 
+ ‘ FOUND FOR DIVISOR IN ROW ’,13) 
END 


SUBROUTINE SOLNQ(A,B,N,NDIM) 


SUBROUTINE SOLNQ : 

THIS SUBROUTINE FINDS THE SOLUTION TO A SET 
OF N LINEAR EQUATIONS THAT CORRESPONDS TO THE RIGHT HAND SIDE 
VECTOR B. THE A MATRIX IS THE LU DECOMPOSITION EQUIVALENT TO THE 
COEFFICIENT MATRIX OF THE ORIGINAL EQUATIONS, AS PRODUCED BY 
LUDCMQ, THE SOLUTION VECTOR IS RETURNED IN THE B VECTOR. 


REAL A(NDIM,NDIM),B(NDIM) ,SUM 
INTEGER N,NDIM,1I,IM1,K,J,NMJP1,NMJP2 


B(1) = B( 
DO 20 T= 
IM1 = 1 
suM = 0. 
DO 10 K 
SUM = 


1,IM1 
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Figure 10.10 (continued) 


10 CONTINUE 


B(I) = ( B(Z) - SUM) / A(I,1I) 


20 CONTINUE 


Of 
Cc 
c 
Cc 
C OF U ON THE DIAGONAL ARE ALL ONES. 
c 
DO 40 J = 2,N 

NMJP2 =N- J +2 

NMJP1 =N- J+1 

SUM = 0.0 

DO 30 K = NMJP2,N 

SUM = SUM + A(NMJP1,K)*B(K) 


30 CONTINUE 


B(NMJP1) = B(NMJP1) - SUM 


40 CONTINUE 
RETURN 
END 


OUTPUT FOR PROGRAM 1 


THE NORMAL EQUATIONS ARE: 


11.00 6.01 


6.01 4.65 
4.65 4.11 
4.11 3.92 
3.92 3.92 
3.92 4.07 
4.07 4.34 


4.34 4.72 


FOR DEGREE OF 1 


-952 
BETA IS 


FOR DEGREE OF 2 


+998 
BETA IS 


FOR DEGREE OF 3 


1.004 
BETA IS 


FOR DEGREE OF 4 


+988 
BETA IS 


4.65 4.11 
4.11 3.92 
3.92 3.92 
3.92 4.07 
4.07 4.34 
4.34 MTZ 
4.72 5.22 
5.22 5.84 


COEFFICIENTS ARE 


-.760 
00102 


COEFFICIENTS ARE 


-1.018 -225 
00023 


COEFFICIENTS ARE 


-1.079 Been 
-00026 


COEFFICIENTS ARE 


-.837 Seaen 
-00027 


OV Oe ea Ww 
a 
iS) 


-.069 


1.046 


SO) UU bow 
tv 
N 


-.456 


4.07 
4.34 
4.72 
5.22 
5.84 
6.59 
7.50 
8.57 


OO Ue 
© 
o> 


NOW WE ARE READY FOR BACK SUBSTITUTION. REMEMBER THAT THE ELEMENTS 


po et RO 


gure 10.10 (continued) 


FOR DEGREE OF 5 


1.037 
BETA IS 


FOR DEGREE OF 6 


1.041 
BETA IS 


FOR DEGREE OF 7 


1.060 
BETA IS 


COEFFICIENTS ARE 


-1.824 4.895 -10.753 
00013 


COEFFICIENTS ARE 


-1.946 5.886 -14.081 
00017 


COEFFICIENTS ARE 


2.562 12.442 -44.820 
00021 
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-3.659 


-7.599 2,112 


-99.706 59.449 - 


PROGRAM FFTIMSL (INPUT, OUTPUT) 


PROGRAM FFTIMSL CALLS THE IMSL SUBROUTINE FFTCC TO COMPUTE 
THE FOURIER TRANSFORM OF N DATA POINTS. 


PARAMETERS ARE: 

CHAT - A COMPLEX ARRAY OF LENGTH N. ON INPUT CHAT CONTAINS 
FUNCTION VALUES, F(K), FOR K = 0 TO N-1. 
ON OUTPUT CHAT CONTAINS THE VALUES, CHAT(J) = SUM FROM K 
EQUALS 0 TO N-1 F(K)*EXP(2*PI*I*K/N). 

N - A INTEGER THAT REPRSENTS THE NUMBER OF DATA POINTS. 

IWK - AN INTEGER VECTOR ARRAY OF LENGTH 6*N + 150. 

WK - A REAL WORK ARRAY OF LENGTH 6*N + 150. 


nNaANNMAAAAANAAAAAAAAARA 


INTEGER N, IWK(534) 
REAL MAGN, EN, WK (534) 
COMPLEX CHAT (64) 


N=64 
EN=FLOAT (N) 
TPIDIVN = 2.0*ACOS(-1.0)/EN 


COMPUTE THE F(I)’S, I EQUAL 0 TO N-1. 


aqaanaa 


DO 5 I=0,N-1 
5 CHAT(I+1) = CMPLX(SIN(TPIDIVN*I) + 4.0*COS(8.0*TPIDIVN*I) 
+ + SIN(10.0*TPIDIVN*I) *COS (15.0*TPIDIVN*I) ) 


igure 10.11 Program 2. 
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Figure 10.1! (continued) 


c 
i 
c 
CALL FFTCC (CHAT, N, IWK, WK) 
c 
( 
c 
PRINT 99 
99 FORMAT (///’ au REAL PART IMAG. PART’, 
+ s MAGNITUDE’ / 
+ ’ +e RRR RRR eae 
% , ARR RRR RR /) 
DO 15 I=1,N 
MAGN = CABS (CHAT (I))/EN 
5 PRINT 98, I-1,CONJG (CHAT (I) ) /EN, MAGN 
98 FORMAT (’ ’,15,5X,F13.6,5X,F13.6,5X,F13.6) 
c 
STOP 
END 
OUTPUT FOR PROGRAM 2 
E REAL PART IMAG. PART MAGNITUDE 
a FOO ee Re eee eee eres 
0 .000000 -000000 - 000000 
ui . 000000 -.500000 +500000 
2 - 000000 «000000 «000000 
3 .000000 -000000 000000 
4 .000000 «000000 . 000000 
5 - 000000 +250000 - 250000 
6 + 000000 000000 + 000000 
gi - 000000 000000 + 000000 
8 2.000000 - 000000 2.000000 
9 + 000000 -000000 .000000 
10 .000000 - 000000 +000000 
11 -000000 -000000 -000000 
12 ~000000 -000000 -000000 
13 .000000 + 000000 +000000 
14 ~000000 - 000000 «000000 
15 .000000 -000000 -000000 
16 000000 -000000 -000000 
17 -000000 + 000000 - 000000 
18 - 000000 -000000 - 000000 
19 .000000 -000000 . 000000 
20 -000000 -000000 - 000000 
21 - 000000 000000 -000000 
22 . 000000 -000000 - 000000 
23 - 000000 -000000 - 000000 
24 -000000 + 000000 - 000000 
25 -000000 -.250000 +250000 
26 -000000 -000000 - 000000 
27 - 000000 000000 -000000 
28 .000000 -000000 .000000 
29 000000 -000000 - 000000 
30 - 000000 -000000 - 000000 
31 - 000000 -000000 -000000 
32 -000000 «000000 - 000000 


33 ~000000 .000000 -000000 
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assist the user in selecting the optimum degree of polynomial. The program was run with 
the input data shown in Fig. 10.4. 

The second program (Fig. 10.11) uses the IMSL subroutine FFTCC to calcu- 
late the é;'s for f(x) = sin (x) + 4 cos (8x) + sin (10x) cos (15x). It is obvious that 
from Eq. (10.21) this example will have ag = 4, b) = 1, —bs = bys = 4, since 
sin (10x) cos (15x) = 4(sin (25x) + sin (5x). By Eq. (10.23) we get c) = —i/2, —cs = 
25 = —i/4, and cg = 2. Recall that, for real functions f(x), ¢_; = G, and c_; = 
Cy-j» J = 1,2, ..., 31. Since this subroutine calculates Sj fre’, our input values 
are the conjugates (fy, /\, . . . , f,-,), and on output we again take the conjugates (Zp, 
Cy, -. +» &,—4) to get the correct results. 

The third program (Fig. 10.12) solves the example at the end of Section 10.2. It 
uses subroutine NLSYST from Chapter 2. 


PROGRAM AERO (INPUT, OUTPUT) 


THIS PROGRAM SOLVES THE NONLINEAR LEAST SQUARES FIT EXAMPLE 
THAT WAS GIVEN AT THE BEGINNING OF THIS CHAPTER AND DERIVED 
IN SECTION 10.2. 


PARAMETERS ARE: 
ROVERC: THE VECTOR CONTAINING X = R/C. 
VOVERV: THE VECTOR CONTAINING Y = V/V. 
NEQUS: THE NUMBER OF UNKNOWNS. HERE THERE ARE TWO. 
NDATA: THE NUMBER OF DATA POINTS. HERE TWENTY ONE. 


SUBROUTINES USED: 
NLSYST: SEE CHAPTER 2. 
ELIM: CHAPTER 2, 


aaaAaAAAAAAAAAAAAAAAAAA 


REAL X(2) ,F (2) /ROVERC (21) , VOVERV (21) ,XTOL, FTOL, DELTA, VALUE 
ae 7 TEMP 3 

INTEGER NDATA, NEQUS, MAXIT 

EXTERNAL FCN 

COMMON VALUE 

COMMON /BLK1/NDATA, NEQUS, ROVERC, VOVERV 

DATA I,MAXIT/0,50/ 

DATA DELTA, XTOL, FTOL, X/0.000125,0.0001,0.0005,0.5,2.0/ 


CALL NLSYST (FCN, NEQUS, MAXIT, X,F,DELTA, XTOL, FTOL, I) 


WE COMPUTE THE SUM OF THE SQUARE ERROR TERMS 
AND PRINT OUT ITS VALUE. 


aaaaa a 


PRINT 100, VALUE 
100 FORMAT (/60(’*')/T9,‘THE SUM OF THE SQUARES ERROR TERM IS:’// 
+ T15,F10.6//) 
STOP 
END 


Figure 10.12 Program 3. 
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Figure 10.12 (continued) 


~~ 


HERE WE COMPUTE THE TWO FIRST PARTIALS THAT MUST BE SET 
EQUAL TO ZERO. 


aaaaaaaa 


SUBROUTINE FCN(X,F) 

REAL X(2),F(2),ROVERC(21),VOVERV(21), TEMP1,TEMP2,TEMP3,SUMF1,SUMF2 

INTEGER NDATA,NEQUS, I 

COMMON VALUE 

COMMON /BLK1/NDATA, NEQUS, ROVERC, VOVERV 

SUMF1 = 0.0 

SUMF2 = 0.0 

VALUE = 0.0 

DO 10 I =1, NDATA 
TEMP1 = 1.0/EXP(X(2) *ROVERC (I) **2) 
TEMP2 = 1.0 - TEMP1 
TEMP3 = VOVERV(I) - X(1)/ROVERC(I) *TEMP2 
SUMF1 = SUMF1 + TEMP3*TEMP2/ROVERC (I) 
SUMF2 = SUMF2 + X(1) *TEMP3*TEMP1*ROVERC (I) 
VALUE = VALUE + TEMP3**2 

10 CONTINUE 
F(1) = SUMF1 
F(2) = SUMF2 


WE USE BLOCK DATA TO INITIALIZE THE VARIABLES WHICH ALSO 
HAVE TO BE IN COMMON /BLK1/. 


BLOCK DATA 

INTEGER NDATA, NEQUS 
REAL ROVERC (21) , VOVERV(21) 

COMMON /BLK1/NDATA, NEQUS, ROVERC, VOVERV 
DATA ROVERC/0.73,0.78,0.81,0.86,0.875,0.89,0.95,1.02,1.03,1.055,1.135 
+,1.14,1.245,1.32,1.385,1.43,1.445,1.535,1.57,1.63,1.755/ 

DATA VOVERV /0.0788,0.0788,0.064,0.0788,0.0681,0.0703,0.0703,0.0681 
+,0.0618,0.079,0.0575,0.0681,0.0575,0.0511,0.0575,0.049,0.0532,0.0511 
+,0.049,0.0532,0.0425/ 

DATA NDATA, NEQUS/21,2/ 

END 
SUBROUTINE NLSYST(FCN,N,MAXIT,X,F,DELTA, XTOL, FTOL,1) 


aangaAaAaa 


SUBROUTINE NLSYST : 


THIS SUBROUTINE SOLVES A SYSTEM OF N NON- 
LINEAR EQUATIONS BY NEWTON’S METHOD. THE PARTIAL DERIVATIVES OF 
THE FUNCTIONS ARE ESTIMATED BY DIFFERENCE QUOTIENTS WHEN A 
VARIABLE IS PERTURBED BY AN AMOUNT EQUAL TO DELTA ( DELTA IS 
ADDED ). THIS IS DONE FOR EACH VARIABLE IN EACH FUNCTION. 
INCREMENTS TO IMPROVE THE ESTIMATES FOR THE X-VALUES ARE COMPU- 
TED FROM A SYSTEM OF EQUATIONS USING SUBROUTINE ELIM. 


PARAMETERS ARE 


aaaAAAAAARAAAAAAAA 
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Figure 10.12 (continued) 


AMAANANAAAAARAAAARAAAAAAAAAAA 


aaaaa aaaaa 


aaaaa 


FCN ~ SUBROUTINE THAT COMPUTES VALUES OF THE FUNCTIONS. MUST 
BE DECLARED EXTERNAL IN THE CALLING PROGRAM, 

N - THE NUMBER OF EQUATIONS 

MAXIT - LIMIT TO THE NUMBER OF ITERATIONS THAT WILL BE USED 

x - ARRAY TO HOLD THE X VALUES. INITIALLY THIS ARRAY HOLDS 


THE INITIAL GUESSES, IT RETURNS THE FINAL VALUES. 

F - AN ARRAY THAT HOLDS VALUES OF THE FUNCTIONS 

DELTA - A SMALL VALUE USED TO PERTURB THE X VALUES SO PARTIAL 
DERIVATIVES CAN BE COMPUTED BY DIFFERENCE QUOTIENT. 

XTOL - TOLERANCE VALUE FOR CHANGE IN X VALUES TO STOP ITERA- 
TIONS. WHEN THE LARGEST CHANGE IN ANY X MEETS XTOL, 
THE SUBROUTINE TERMINATES, 


FTOL - TOLERANCE VALUE ON F TO TERMINATE. WHEN THE LARGEST F 
VALUE IS LESS THAN FTOL, SUBROUTINE TERMINATES. 
£ - RETURNS VALUES TO INDICATE HOW THE ROUTINE TERMINATED 


XTOL WAS MET 

FTOL WAS MET 

MAXIT EXCEEDED BUT TOLERANCES NOT MET 

VERY SMALL PIVOT ENCOUNTERED IN GAUSSIAN ELIMINATION 
STEP - NO RESULTS OBTAINED 

I=-3 INCORRECT VALUE OF N WAS SUPPLIED - N MUST BE BETWEEN 
2 AND 10 


REAL X(N) ,F(N) ,DELTA, XTOL, FTOL 
INTEGER N,MAXIT, I 

REAL A(10,11),XSAVE (10) ,FSAVE (10) 
INTEGER NP, IT, IVBL, ITEST, IFCN, IROW, JCOL 


CHECK VALIDITY OF VALUE OF N 


IF (N .LT. 2 .OR. N .GT. 10 ) THEN 
I= -3 
PRINT 1004, N 
RETURN 

END IF 


BEGIN ITERATIONS - SAVE X VALUES, THEN GET F VALUES 


NP =N+#1 
DO 100 IT = 1,MAXIT 
DO 10 IVBL = 1,N 
XSAVE(IVBL) = X(IVBL) 
10 CONTINUE 
CALL FCN (X,F) 


TEST F VALUES AND SAVE THEM 


ITEST = 0 
DO 20 IFCN = 1,N 
IF ( ABS(F(IFCN)) .GT. FTOL ) ITEST = ITEST + 1 
FSAVE(IFCN) = F(IFCN) 
20 CONTINUE 
IF ( I .EQ. O ) THEN 
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Figure 10.12 (continued) 


PRINT 1000, IT,X 
PRINT 1001, F 


END IF 
ic 
CC ae 
c 
C SEE IF FTOL IS MET. IF NOT, CONTINUE. IF SO, SET I = 2 AND RETURN. 
ce 
IF ( ITEST .EQ. 0 ) THEN 
Tie 
RETURN 
END IF 
c 
Oi) Soaseecen memes ce eee See ea A ee et ae pee eee 
c 
C THIS DOUBLE LOOP COMPUTES THE PARTIAL DERIVATIVES OF EACH FUNCTION 
C FOR EACH VARIABLE AND STORES THEM IN A COEFFICIENT ARRAY. 
ic 
DO 50 JCOL = 1,N 
X(JCOL) = XSAVE(JCOL) + DELTA 
CALL FCN(X,F) 
DO 40 IROW = 1,N 
A(IROW,JCOL) = (F(IROW) - FSAVE(IROW)) / DELTA 
40 CONTINUE 
c 
C RESET X VALUES FOR NEXT COLUMN OF PARTIALS 
c 
X(JCOL) = XSAVE(JCOL) 
50 CONTINUE 
c 
 abtes secede s e e e e aeeeees 
€ 
C NOW WE PUT NEGATIVE OF F VALUES AS RIGHT HAND SIDES AND CALL ELIM 
c 
DO 60 IROW = 1,N 
A(IROW,NP) = -FSAVE(IROW) 
60 CONTINUE 
CALL ELIM(A,N,NP, 10) 
(2 
© ssecnthessatseedee ec nie ae ii it te oh a eens 
c 
C BE SURE THAT THE COEFFICIENT MATRIX IS NOT TOO ILL-CONDITIONED 
c 
DO 70 IROW = 1,N 
IF ( ABS(A(IROW,IROW)) .LE. 1.0E-6 ) THEN 
I= -2 
PRINT 1003 
RETURN 
END IF 
70 CONTINUE 
e 
@ Seeeeeete eee 
c 
C APPLY THE CORRECTIONS TO THE X VALUES, ALSO SEE IF XTOL IS MET. 
c 
ITEST = 0 
DO 80 IVBL = 1,N 
X(IVBL) = XSAVE(IVBL) + A(IVBL,NP) 
IF ( ABS(A(IVBL,NP)) .GT. XTOL ) ITEST = ITEST + 1 
80 CONTINUE 
€ 
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Cc 

C IF XTOL IS MET, PRINT LAST VALUES AND RETURN, ELSE DO ANOTHER 
C ITERATION. 
Cc 


IF ( ITEST .EQ. 0 ) THEN 
Eel 
IF (I .EQ. 0 ) PRINT 1002, IT,X 
RETURN 
END IF 
100 CONTINUE 


WHEN WE HAVE DONE MAXIT ITERATIONS, SET I = -1 AND RETURN 


aaaaa 


I=-1 
RETURN 


a 


1000 FORMAT(/’ AFTER ITERATION NUMBER’,I3,’ X AND F VALUES ARE’ 
+ //10F13.5) 
1001 FORMAT (/10F13.5) 
1002 FORMAT(/’ AFTER ITERATION NUMBER’,13,’ X VALUES (MEETING’, 
+ * XTOL) ARE '//10F13.5) 
1003 FORMAT(/’ CANNOT SOLVE SYSTEM. MATRIX NEARLY SINGULAR.’ ) 
1004 FORMAT(/’ NUMBER OF EQUATIONS PASSED TO NLSYST IS INVALID.’, 
+ ‘ MUST BE 1 < N < 11. VALUE WAS ’,13) 


END 
SUBROUTINE ELIM(AB,N,NP,NDIM) 


SUBROUTINE ELIM : 

THIS SUBROUTINE SOLVES A SET OF LINEAR EQUATIONS 
AND GIVES AN LU DECOMPOSITION OF THE COEFFICIENT MATRIX, THE GAUSS 
ELIMINATION METHOD IS USED, WITH PARTIAL PIVOTING. MULTIPLE RIGHT 
HAND SIDES ARE PERMITTED, THEY SHOULD BE SUPPLIED AS COLUMNS THAT 
AUGMENT THE COEFFICIENT MATRIX. 


PARAMETERS ARE: 


AB - COEFFICIENT MATRIX AUGMENTED WITH R.H.S. VECTORS 

N - NUMBER OF EQUATIONS 

NP - TOTAL NUMBER OF COLUMNS IN THE AUGMENTED MATRIX 
NDIM - FIRST DIMENSION OF MATRIX AB IN THE CALLING PROGRAM 


THE SOLUTION VECTOR(S) ARE RETURNED IN THE AUGMENTATION COLUMNS 
OF AB. 


aAaANAAAAARAAAAAAAAAAAAAAAA 


REAL AB(NDIM, NP) 

INTEGER N,NP,NDIM 

REAL SAVE, RATIO, VALUE 

INTEGER NM1,IPVT,IP1,J,NVBL,L,KCOL, JCOL, JROW 


BEGIN THE REDUCTION 


aaaaa 
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Figure 10.12 (continued) - 


aaqaaa aaaa 


aaaa 


aaaanaa 


aaana 


aaaaa 


FIND THE ROW NUMBER OF THE PIVOT ROW, WE WILL THEN INTERCHANGE ROWS TO 
PUT THE PIVOT ELEMENT ON THE DIAGONAL. 


IPVT = I 
IPl=1I+1 
DO 10 J = IP1,N 
IF ( ABS(AB(IPVT,I)) .LT. ABS(AB(J,I)) ) IPVT=a 
10 CONTINUE 


CHECK FOR A NEAR SINGULAR MATRIX 


IF ( ABS(AB(IPVT,I)) .LT. 1.0 E-6 ) THEN 
PRINT 100 
RETURN 

END IF 


NOW INTERCHANGE, EXCEPT IF THE PIVOT ELEMENT IS ALREADY ON THE 
DIAGONAL, DON’T NEED TO. 


IF ( IPVT .NE. I ) THEN 
DO 20 JCOL = 1,NP 
SAVE = AB(I,JCOL) 
AB(I,JCOL) = AB(IPVT,JCOL) 
AB(IPVT,JCOL) = SAVE 
20 CONTINUE 
END IF 


NOW REDUCE ALL ELEMENTS BELOW THE DIAGONAL IN THE I-TH ROW. CHECK FIRST 
TO SEE IF A ZERO ALREADY PRESENT. IF SO, CAN SKIP REDUCTION ON THAT ROW. 


DO 32 JROW = IP1,N 
IF ( AB(JROW,I) .EQ. 0 ) GO TO 32 

RATIO = AB(JROW,I) / AB(I,I) 

AB(JROW,I) = RATIO 

DO 30 KCOL = IP1,NP 

AB(JROW,KCOL) = AB(JROW,KCOL) - RATIO*AB(I,KCOL) 

30 CONTINUE 
32 CONTINUE 
35 CONTINUE 


WE STILL NEED TO CHECK AB(N,N) FOR SIZE 


IF ( ABS(AB(N,N)) .LT. 1.0 E-6 ) THEN 
PRINT 100 
RETURN 

END IF 


NOW WE BACK SUBSTITUTE 


NPl1 =N+1 
DO 50 KCOL = NP1,NP 


Figure 10.12 (continued) 


AB (N, KCO. 
po 45 J 
NVBL = 
L = NV 
VALUE 
DO 40 


L) = AB(N,KCOL) 


= 2,N 
NPL - J 
BL + 1 


K 


L,N 


AB (NVBL, KCOL) 


/ AB(N,N) 


VALUE = VALUE - AB(NVBL,K) * 


40 CONTIN 
AB (NVB. 

45 CONTINUE 

50 CONTINUE 


UE 
L,KCOL) = 
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AB (K, KCOL) 


VALUE / AB(NVBL,NVBL) 


RETURN 
Cc 
100 FORMAT(/’ SOLUTION NOT FEASIBLE. A 
+ ‘WAS ENCOUNTERED. ‘ ) 
END 


OUTPUT FOR PROGRAM 3 


AFTER ITERATION 
+50000 
-5.58847 

AFTER ITERATION 
+16954 
-.78095 

AFTER ITERATION 
+11395 
-.09659 

AFTER ITERATION 
+12644 
+04134 

AFTER ITERATION 
.11310 
+01645 

AFTER ITERATION 
08925 
-06555 

AFTER ITERATION 
08899 


-00211 


NUMBER 1 X AND F VALUES 


2.00000 
-.40348 
NUMBER 2 
1.32113 
-.03873 
NUMBER 3 
1.01639 
-.00289 
NUMBER 4 
.58295 
00989 
NUMBER 5 
.77840 
00448 
NUMBER 6 
1.13874 
00379 
NUMBER 7 
1.39420 
.00098 


x 


AND 


AND 


AND 


AND 


AND 


AND 


F VALUES 


F VALUES 


F VALUES 


F VALUES 


F VALUES 


F VALUES 


NEAR ZERO PIVOT ', 


ARE 


ARE 


ARE 


ARE 


ARE 


ARE 


ARE 
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Figure 10.12 (continued) ax 


AFTER ITERATION NUMBER & X AND F VALUES ARE 
.07895 1.82196 


-03043 -00071 
ITERATION NUMBER 9 X AND F VALUES ARE 
07851 2.04292 
00222 -00015 
ITERATION NUMBER 10 
-07651 2.24423 
00288 00005 
AFTER ITERATION NUMBER 11 
-07618 2.30574 
-00019 -00000 


SAREE ERE EERE ERT E EEE EE EEE EE EEE EEE EE EERE TESTE ETE EE EEE 


THE SUM OF THE SQUARES ERROR TERM IS: 


- 000631 


EXERCISES 


Section 10.1 

1. Find the individual deviations of the data in Fig. 10.1 from those computed from the least- 
squares line, R = 702.2 + 3.3957. Compare these deviations with those from the line drawn 
by eye, R = 700 + 3.5007. Find the sum of squares of deviations in each case and compare. 
Note that even though the maximum deviations from each of the two lines are not too different, 
the sums of squares differ significantly. 

> 2. Show that the point whose x-coordinate is the mean of all the x-values and whose y-coordinate 
is the mean of all the y-values satisfies the least-squares line. Often a change of variable is 
made to relocate the origin at this point. with a corresponding reduction in the magnitude of 
the numbers worked with, making them more readily handled by hand or on some desk 
calculators. 

> 3. Find the least-squares line that fits to the following data, assuming the x-values are free of 


(The data are tabulated from y = 2x, with perturbations from a table of random numbers.) 


Table 10.8 


Table 10.9 


EXERCISES 675 


In the data for Exercise 3, consider that the y-values are free of error and that all the errors 
are in the x-values. By suitable modifications of the normal equations, now determine the 
least-squares line for x = ay + b. Note that this is not the same line as determined in Exercise 
3. 


In multivariate analysis, the least-squares technique is used to determine the hyperplane from 
which the sum of the squares of the deviations is a minimum. For z, a function of two 
independent variables, x and y, find the normal equations to determine the parameters a, b, 
and c in: Ss 

z=axt byte. 


Use this relation to find the least-squares values for the constants if the data below are available. 


x 0 LZ 2.1 3.4 4.0 4.2 5.6 5.8 6.9 
y 0 0.5 6.0 0.5 3.1 3.2 1,3 7.4 10.2 
Zz 1.2 3.4 —4.6 9:9: 2.4 7.2 14,3 a8 1.3 


Section 10.2 


6. 


Observe that the data in Table 10.8 seem to be fit by a curve y = ae’ by plotting on semilog 
paper and noting that the points then fall near a straight line. The data are for the solubilities 
of n-butane in anhydrous hydrofluoric acid at high pressures, and were needed in the design 
of petroleum refineries. 


Temperature, °F Solubility, weight % 
71 2.4 
100 3.4 
185 7.0 
239 Tt 
285 19.6 


By plotting on rectilinear graph paper, observe that the relationship is nonlinear. 

Determine the constants for y = ae”* for the data in Exercise 6 by the least-squares method 
by fitting to the relation In y = Ina + bx. 

It is suspected (from theoretical considerations) that the rate of flow from a fire hose is 
proportional to some power of the pressure at the nozzle. Determine whether the speculation 
seems to be true, and what the exponent is from the data in Table 10.9. (Assume that the 
pressure data are more accurate.) 


Se 2 


Flow, gallons per minute Pressure, psi 
94 10 
118 16 
147 25 
180 40 


230 60 
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Table 10.10 


>10. 


11. 


>12. 


Since the data of Exercise 8, when plotted on log-log paper (pressure as a function of flow), 
seem to have a slope of nearly 2, we should expect that fitting a quadratic to the data would 
be successful. Do this and compare the deviations with those from the power-function relation 
of Exercise 8. 


The data given in Exercise 3, while perturbations from a linear relation, seem to plot better 
along a curve because of the accidental occurrence of three negative deviations in succession, 
Fit a quadratic to the data. 


To compare the results of an exact polynomial fit according to Chapter 3 with the least-squares 
procedure, find y-values at x = 1.5, 2.5, 3.5, 4.5, and 5.5 for the data of Exercise 3, utilizing 
a fifth-degree interpolating polynomial. Sketch the interpolating polynomial and compare to 
the least-squares line of Exercise 3, and the least-squares quadratic of Exercise 10. The slope 
of the “true” function is 2.0. How do the maximum and minimum values of the slope as 
determined from the three approximations (use a graphical procedure) compare to the “true” 
value? 


The data in Table 10.10 seem to fit a cubic equation, but determine by least squares the 
optimum degree of polynomial. 


Repeat Exercise 12, but now using every other point (x = 0.1, x = 1.6, x = 2.5, and so 
on). Do you get the same result for the optimum degree of polynomial? Are the coefficients 
the same? Repeat again, with the other half of the points. 

The data of Exercise 12 also suggest a function of the form y = A + B sin Cx. How could 
the least-squares method be used to evaluate A, B, and C? In solving the normal equations, 
are any difficulties encountered? What if theory suggested that C = 77/10? Would this make 
any significant difference in solving the normal equations? 


Section 10.3 


Compute 7, ,(x) and T, >(x). 

Show that Eq. (10.9) is satisfied for these values of (m, n): (0, 1), (1, 1), (0, 0), (1. 2). 

a) Graph 7;(x) on [—1, 1]. 

b) Extend the graphs of 7;(x), Tx(x), T;(x) to the interval [—2, 2]. Observe that the maximum 
magnitude of the Chebyshev polynomials is not equal to one outside of [—1, 1]. 
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Section 10.4 


18. 


19. 


»20. 


nm 
i) 


nm 
a 


Reduce Eq. (10.11) to a fourth-degree economized polynomial and show that its maximum 
error is 0.000791. 


The error curve for a truncated Maclaurin-series approximation increases monotonically as x 
varies from 0 to the ends of the interval. This is not true for an economized power series. 
Exhibit the form of the error curve by plotting the error of Eq. (10.11) on the interval 
[0, 1]. 


Given 


arctan x = 


Plot the error over the interval [—1, 1] when the above series is truncated after the term in 
x7. Economize the ninth-degree truncated power series three times (giving a third-degree 
expression), and plot its error over the interval [—1, 1]. 


Find the first few terms of the Chebyshev series for sin x by rewriting the Maclaurin series 
in terms of 7;(x) and collecting terms. Express this as a power series, Compare the errors 
when both series are truncated to third-degree polynomials. 

A power-series expansion of (1 + x/5)!? is 


x x? x x x 
— + 


% = eee as b = fen 
’ 10 200 ©2000 16,000 800,000 - 


Convert this to a Chebyshev series including terms to 7,(x). What is the maximum error of 
the truncated Chebyshev series? Compare this to the error of the power series truncated after 
the x? term. 


The Chebyshev series, and economized polynomials as well, require us to approximate the 
function on the interval [—1, 1] only. Show that an appropriate change of variable will change 


F(x) on [a, b] to f(y) on [—1, 1]. Find the linear relation between x and y that makes this 


transformation. 


Section 10.5 


»24. 


nm 
wn 


Find Padé approximations to the following functions, with numerators and denominators each 
of degree three 


a) sin x; b) cos x; mc) e*, 
Compare the errors on the interval [—1, 1] for the Padé approximations of Exercise 24 with 
the errors of the corresponding Maclaurin series. 
Express these rational functions in continued-fraction form: 
A) xe — De + 2 b) ae +x? +243 

xp 2k = 2 Se —4 
c) 2x4 + 45x? + 381x? + 1353x + 1511 

x3 + 21x? + 157x + 409 


In each case, compare the number of multiplication and division operations in continued 
fraction form with initial evaluation of the polynomials by nested multiplication. 


Express the Padé approximations of Exercise 24 as continued fractions. 


Estimate the errors of the Padé approximations of Exercise 24 by computing the coefficient 
of the next nonzero term in the numerator. Compare these estimates with the actual errors at 
x = 1, and atx = —1. 


»29. The Chebyshev series for cos(7x/4) is 
0.851632 — 0.1464377, + 0.00192145T, — 9.965 x 10-°T,. 

Develop a Padé-like rational function from this by the method of Section 10.5, where 
function is R, >. 

30. Fike (1968) gives this example of a rational fraction approximation to T'(1 + x) on [0, 1]: 

RG) = 0.999999 + 0.601781Lx + 0.186145x? + 0.0687440x? 
3.40) ~ TF 1.17899x — 0.122321x2 — 0.260996x> + 0.0609927x5° 

How could you determine if this is a minimax approximation? Could one find bounds to 
error of the equivalent minimax R, ,(x) if it is not? 

Section 10.6 

31. Verify the results of the first example of Section 10.6. 

32. Show that for N = 16 equispaced data points, one can find the ¢,’s by 64 multiplications 
rather than the expected 256. 

33. Find the c,’s for the “sawtooth” function, f(x) = x on [0, 7] and f(x) = 27 — x on [7, 271], 
Eq. (10.24). 

34. Use FFTCC or some other subroutine to compute the results in Exercise 31. 

35. Show that for V given points, (x,, fi), kK = 0, 1,...,N — land x, = 2ak/N, then é = 
).y- In fact, show that for any multiple of N, say M, & = 6.4. 

36. For a periodic function f(x) to be represented by a Fourier series, the following conditions 
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are sometimes given: 


a) The function has at most a finite number of discontinuities in any period. 
b) The function must contain only a finite number of maxima and minima in any period. 
c) The function must be absolutely integrable. 


L 
[ \f(x)| dx < =, where L is the period of the function. 


Which of the following functions satisfy or do not satisfy these conditions? 


at 
D fe) = sin (0, 7] 
0 x=0 
ee: 
ii) roy = {5 eee 
w po {t B92 
1 
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37. In many situations, experimental data fit to an exponential relation: f(x) = ae". To determine 
the constants through least squares, one frequently fits the function In f = In a + bx to the 
data by least squares. In this manner one can find the proper values for the constants b and 
In a quite readily, for the log relation is linear in these terms. However, we might wish to 
determine the constants a and 6 of the original function, without changing it to a logarithm. 
Develop a procedure for doing this, which will require solving a set of nonlinear normal 
equations, Use the data in Exercise 6 to test your procedure; then compare your values for 
a and b with those obtained in Exercise 7- 


38. Penrod and Prasanna (1962) measured the total sun energy reaching their test site in Lexington, 


Ky., during the winter months, The insolation was recorded both for a vertical receiver and 
for one tilted at an angle of 25°45’. Fit the data by least-squares polynomials 


Month Sept. Oct. Nov. Dec. Jan. Feb. Mar. Apr. 
Tritt 2177 1952 1545 1215 1140 1523 1615 1713 
Wen 1484 1523 1290 1041 963 1208 1143 1043 


39. Look up the algorithms used on your computer to compute the functions built into FORTRAN. 
Classify them into Taylor-series formulas, Chebyshev-polynomial formulas, rational func- 
tions, and other types. Which of them are minimax? If more than one computer system is 
accessible to you, compare the different systems. How does BASIC compute these functions? 


40. Write a computer program to get the é, for a discrete Fourier transform by FFT for NV equal 
to a power of 2. 
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Some Basic Information 
from Calculus 


Since a number of results and theorems from the calculus are frequently used in the text, 
we collect here a number of these items for ready reference, and to refresh the student's 
memory. 


OPEN AND CLOSED INTERVALS 


We use for the open interval a < x < b, the notation (a, b), and for the closed interval 
a =x <b, the notation [a, b]. 


CONTINUOUS FUNCTIONS 


If a real-valued function is defined on the interval (a, b), it is said to be continuous at a 
point xo in that interval if for every e > 0 there exists a positive nonzero number 6 such 
that | f(x) — f(x)| < © whenever |x — x9| < 65 and a < x < b. In simple terms, we can 
meet any criterion of matching the value of f(x) (the criterion is the quantity €) by 
choosing x near enough to x, without having to make x equal to x9, when the function 
is continuous. 

If a function is continuous for all x-values in an interval, 


it is said to be 


nterva n n that ont ous on a ed int 


It will also assume any value between the maximum and the mi 
the interval. 

Similar statements can be made about a function of two or more variables. We then 
refer to a domain in the space of the several variables instead of to an interval. 


Figure A.1 
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SUMS OF VALUES OF CONTINUOUS FUNCTIONS 


When x is in [a, b], the value of a continuous function f(x) must be no greater than the 
maximum and no less than the minimum value of f(x) on [a, b]. The sum of n such 
values must be bounded by (n)(m) and (n)(M), where m and M are the minimum and 
maximum values. Consequently the sum is n times some intermediate value of the 
function. Hence 


31g) = me) ifa<&<b, i=1,2,...,n, asé=<b 
Similarly, it is obvious that 
ey f(E) + cr fl&) = (cy + e2)f(&),  &, &. Ein [a, b), 
for the continuous function f when c, and cy are both equal to or greater than one. If the 
coefficients are positive fractions, dividing by the smaller gives 


ey fl&) + erf(&) = olsen + | = a(t + 2) ne) = (c, + cp f(é), 


so the rule holds for fractions as well. If ¢, and c> are of unlike sign, this rule does not 
hold unless the values of f(&,) and f(g) are narrowly restricted. 


MEAN-VALUE THEOREM FOR DERIVATIVES 


When f(x) is continuous on the closed interval [a, b], then at some point é in the interior 
of the interval fb ' 
1 (b) — fla) 
SS gx E <b, 

f'é) Bends a<é 
provided, of course, that f’(x) exists at all interior points. Geometrically this means that 
the curve has at one or more interior points a tangent parallel to the secant line connecting 


the ends of the curve (Fig. A.1). 


f(x) 
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MEAN-VALUE THEOREMS FOR INTEGRALS 


If f(x) is continuous and integrable on [a, b], then 
b 
[ fm a= 6 - of, a<€&<b. 


This says, in effect, that the value of the integral is an average value of the function times | 
the length of the interval. Since the average value lies between the maximum and minimum I 
values, there is some point & at which f(x) assumes this average value. 

If f(x) and g(x) are continuous and integrable on [a, b], and if g(x) does not change 
sign on [a, b], then 


b b 
[ F(x)g(x) dx = 16 | g(x)dx, a<&<b. 


Note that the previous statement is a special case [g(x) = 1] of this last theorem, which 
is called the second theorem of the mean for integrals. 


TAYLOR SERIES 
If a function f(x) can be represented by a power series on the interval (—a, a), then the 
function has derivatives of all orders on that interval and the power series is 


f(x) = f(O) + f'(O)x + LO), + £03 tee. 


The preceding power-series expansion of f(x) about the origin is called a Maclaurin series. 
Note that if the series exists, it is unique and any method of developing the coefficients 
gives this same series. 

If the expansion is about the point x = a, we have the Taylor series 


F(x) = fla) + f'(ay(x — a) + fa), — a) + £@ x 


We frequently represent a function by a polynomial approximation, which we can 
regard as a truncated Taylor series. Usually we cannot represent a function exactly by 
this means, so we are interested in the error. Taylor's formula with a remainder gives us 
the error term. The remainder term is usually derived in elementary calculus texts in the 
form of an integral: 


" (m) 
f"(a) 2 1+ LOG — ay 


f(x) = fla) + fax — a) + 3 A ay + 


eae 
+| OD penne) at, 


n! 

Since (x — 1) does not change sign as 1 varies from a to x, the second theorem of the 
mean allows us to write the remainder term as 

(x — a)"*! 


Remainder of Taylor series = — 


“Ge pt Es Fin tas a. 
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The derivative form is the more useful for our purposes. It is occasionally useful to 
express a Taylor series in a notation that shows how the function behaves at a distance 
h from a fixed point a. If we call x = a + h in the above, so x — a = h, we get 


‘| 2 ' LAM pa 4 LOD png LPO ps 
fla + h) = fla) + fi@h + Gy + + a + Cah. 


TAYLOR SERIES FOR FUNCTIONS OF TWO VARIABLES 


For a function of two variables, f(x, y), the rate of change of the function can be due to 
changes in either x or y. The derivatives of f can be expressed in terms of the partial 
derivatives. For the expansion in the neighborhood of the point (a, b), 


f(x, y) = fla, b) + fla, bx — a) + fila, by — b) 
+ Flea b\(x — a)? + 2f,(a, bx — ay — b) + fla, by — by?) 


he 7S 


DESCARTES’ RULE OF SIGNS 
Let p(x) be a polynomial with real coefficients and consider the equation p(x) = 0. Then 


1. The number of positive real roots is equal to the number of variations in the 
signs of the coefficients of p(x) or is less than that number by an even integer. 


n 


The number of negative real roots is determined the same way, but for p(—x). 
Here also the number of negative roots is equal to the number of variations in 
the signs of the coefficients of p(—x) or is less than that number by an even 
integer. 


For example, the polynomial p(x) = 3x5 — 2x4 + 7x2 — 12 = 0 will have 3 or 1 
positive and 2 or 0 negative real roots. 
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Deriving Formulas by the 
Method of Undetermined 
Coefficients 


In Chapter 3 we developed formulas for integration and differentiation of functions by 
replacing the actual function with a polynomial that agrees at a number of points (a so- 
called interpolating polynomial), and then integrating or differentiating the polynomial. 
These formulas are valuable if one wishes to write computer programs for integration 
and differentiation, but the most important use is for solving differential equations numer- 
ically. Because computers calculate their functional values rather than interpolating in a 
table, there is less interest today in interpolating polynomials than in earlier times. Hence, 
there is reason to present an alternative method of deriving the formulas for derivatives 
and integrals that are needed to solve differential equations 

We will call the method that we employ in this appendix the method of undetermined 
coefficients. Basically, we impose certain conditions on a formula of desired form and 
use these conditions to determine values of the unknown coefficients in the formula. 
Hamming (1962) presents the method in considerable detail. 


DERIVATIVE FORMULAS BY THE METHOD OF 
UNDETERMINED COEFFICIENTS 


Since the derivative of a function is the rate of change of the function relative to changes 
in the independent variable, we should expect that formulas for the derivative would 
involve differences between function values in the neighborhood of the point where we 
wish to evaluate the derivative. It is, in fact, possible to approximate the derivative as a 
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linear combination of such function values. While one can argue that formulas of greatest 
accuracy will use function values very near to the point in question, the formulas important 
in practice impose the restriction that only function values at equally spaced x-values are 
to be used. 

For example, we can write a formula for the first derivative in terms of n + 1 
equispaced points: 


FG) = Cof@o) + erfly) + erfOQ) + +++ + Cn f Oy), 


Xj4) — X) = A = constant. (B.1) 


The more terms we employ, the greater accuracy we shall expect, since more infor- 
mation about the function is being fed in. We will evaluate the coefficients in the equation 
by requiring that the formula be exact whenever the function is a polynomial of degree 
n or less.* (We will find, throughout this chapter, that the method of undetermined 
coefficients uses this criterion to develop formulas. It has validity because any function 
that is continuous on an interval can be approximated to any specified precision by a 
polynomial of sufficiently high degree. Using polynomials to replace the function also 
greatly simplifies the work, in contrast to replacing with other functions.) 

Let us illustrate the method of undetermined coefficients with a simple case. We shall 
simplify the notation by defining 


SF, = f(x). 


If we write the derivative in terms of only two function values, we would have 
fo = Cofa + eh. (B.2) 


We require that the formula be exact if f(x) is a polynomial of degree | or less. (The 
maximum degree is always one less than the number of undetermined constants.) Hence, 
since the formula is to be exact if f(x) is any first-degree polynomial, it must be exact 
if either f(x) = x or f(x) = 1, for these definitions of f(x) are just two special cases of 
the general first-degree function ax + b. We write Eq. (B.2) for both these cases: 


fQ) = f') 
fey =1,  f'o 


x, ie 1 


0. 0 


Co(X%p) + C\(% + A), 
co(1) + ¢,(1). (B.3) 


il] 
i 


We solve Eqs. (B.3) simultaneously to get cg = —1/h, c; = 1/h. Consequently, 
fo) = hake (B.4) 


The dot over the equal sign in Eq. (B.4) is to remind us that the formula is only an 
approximation. It is exact if f(x) is a polynomial of degree 1, but not exact if a polynomial 
of higher degree, or some transcendental function. 


*Intuitively, it seems reasonable that we can satisfy this criterion, for a polynomial of degree n is determined 
uniquely by its n + | coefficients and our formula contains n + 1 constants. 
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Similarly, a three-term formula can be derived by replacing the function with x? and 
x and | in + 


fo = Cofo + eifi + err 


The set of equations to solve is 


2p = co(Xo)? + Cy(Xo + A)? + C2(xq + 2h), 
1 = co(xo) + €,(%p + h) + €2(xq + 2h), 
0 = co(1) + €,(1) + ©,(1). (B.5) 


The arithmetic is simplified by letting xy = 0. That this is valid is readily seen. Imagine 
the graph of f(x) versus x. The derivative we desire is the slope of the curve at the point 
where x = x9, which obviously is unchanged by a translation of the axes. Taking xp = 
0 is the equivalent of a translation of axes so that the origin corresponds to x9. Equations 
(B.5) become, with this change, 


0 = co(0) + c(h)? + c(2h)?, 
1 = co(0) + e,(h) + €2(2h), 
0 = co(1) + e,(1) + (1). 
Solving, we get ¢g = —3/2h, c, = 2/h, c. = —1/2h; so a three-term formula for the 


derivative is 
12 ~3fo t+ 4h ~h 
fo= Th F (B.6) 
We could extend these formulas to include more and more terms, but after a little reflection 
we can conclude that the original form, Eq. (B.1), is not the best to use. It utilizes only 
functional values to one side of the point in question, while it would be better to utilize 
information from both sides of x. After all, the limit in the definition of the derivative 
is two-sided, and further, we utilize closer and hence more pertinent information by going 
to both the left and the right. 
We expect an improvement over Eq. (B.6) if we begin with 


fo = ¢-1f-1 + cofo + esfis (B.7) 


where f_,; = f(%q — 4). Adopting the simplification of letting x9 = 0 as before, and 
taking x?, x, and 1 for f(x), we have to solve 


0 = c_\(—hy? + cg(0) + c(h), 
1 = c_,(—A) + co(0) + c,(A), 
0 = c_,(1) + eg(1) + (1). (B.8) 


Completing the algebra, we get c_; = —1/2h, co = 0, cy = 1/2h. 
The following equation is particularly important: 


p=S5 2. (B.9) 
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In the next section we will compare its accuracy with Eqs. (B.4) and (B.6). Because the 
point where the derivative is evaluated is centered among the function values whose 
differences appear in the formula, it is called a central-difference approximation. Higher- 
order central difference approximations to the first derivative, utilizing five or seven or 
more points, can be derived by this same procedure. 

We now apply the method of undetermined coefficients to higher derivatives. We 
will discuss only the central difference approximations since they are the more widely 
used. In terms of values both to the right and left of xo, 


fo = C-f-1 + cofo + eh: 


. x, and 1, we get the relations (xp = 0), 


Taking f(x) = 
2 = c_\(—hy? + co(0) + €,(h)*, 
0 = c_\(—h) + c9(0) + €,(h), 
0 = c_,(1) + eg(1) + ¢,(1). 


The resulting formula is 


me = 2h +h (B.10) 


Equation (B.10), like Eq. (B.9), is particularly useful. 

We are not confined to using only functional values in the method of undetermined 
coefficients.* Suppose we have values of the first derivative as well as functional values. 
A formula for the second derivative might then be written as 


fo = Cofo + efi + esfo- 
As before, we take x?, x, and | for f(x), with xp = 0: 

2 = e9(0)? + ¢\(h)? + €3(0), 

0 = co(0) + c,(h) + c3(1), 

0 = co(1) + (1) + €3(0). 


Solving, we obtain 


In the same way, we can derive formulas for the third and fourth derivatives. Deriv- 
atives beyond these do not often appear in applied problems. We must use a minimum 
of four and five terms, however, since only polynomials of degree 3 and 4 have nonzero 
third or fourth derivatives. For the third-degree formula, complete symmetry with four 


*We must be sure that the set of values is sufficient to determine a polynomial uniquely, however. For example, 
foi. fis and f§, will not give a formula for fj. 
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B.2 


points is impossible; five-term formulas for both derivatives are therefore given. 
present the results only, leaving the derivations as an exercise: 


fea chats hh (B.11) 


py = Ab Sh ah ity (B.12) 


ERROR TERMS FOR DERIVATIVE FORMULAS 


In the previous section, we used the method of undetermined coefficients to derive several 
formulas for the first derivative of a function, utilizing function values at equispaced x- 
values. These were, using the notation that f; = f(x,), 


foe heh, (B.13) 
fix —3fo 74 = (B.14) 
fpeish. (B.15) 


While each of these formulas is not exact, as suggested by the = symbol, we argued 
heuristically that the error should decrease in each succeeding one. We now wish to 
develop expressions for the errors. 

Begin with the Taylor-series expansion of f(x,) = f(x + A) in terms of 
XH hh: 


f(x) = fx) + hf'(x%o) + SF) + BF") ey 


Changing to our subscript notation, and truncating after the term in h, with the usual 
error term, we have 


Si = fo + hfo + AG Xp <E< xy th. (B.16) 


Solving for f, we have 


foe= oe - SH"), x <E<X +h, (B.17) 


so that the error term is — inf") for the derivative formula, Eq. (B.13). 

In Eq. (B.17), the error term involves the second derivative of the function evaluated 
at a place which is known only within a certain interval. In the majority of applications 
for numerical differentiation, not only is the point of evaluation uncertain, but the function 
f(x) is also unknown. If we do not know f(x), we can hardly expect to know its derivatives. 


DERIVING FORMULAS BY THE METHOD OF UNDETERMINED COEFFICIENTS A-I1 


We do know, however, that the error involves the first power of h = x;,, — x;, and, in 
fact, the only way we can change the error is to change 4. Making / smaller will decrease 
h, and in the limit as h — 0 the error will go to zero. Further, as h > 0, x; > Xo, and 
the value of & is squeezed into a smaller and smaller interval. In other words, f"(é) 
approaches a fixed value, specifically f"(xo), as h goes to zero. 

The special importance of h in the error term is denoted by a special notation in 
numerical analysis, the order relation. We say the error of Eq. (B.17) is “of order h” 
and write 

Error = O(h), when lim (error) = ch, 
ho 
where c is a fixed value not equal to zero. 

To develop the error term for Eq. (B.14), we proceed similarly, except that we write 
expansions for both f, and fy. It is also necessary to carry terms through /> because the 
second derivatives cancel, resulting in an error term involving the third derivative: 


Ldn y Line 
fy = fo + hfo + sh°fo + eit (é), Xo SG <A th 
fo = fo + Wf + HOW + LAMPE), sy < & <9 +H. 


Note that the values of €; and & may not be identical. If we multiply the first equation 
by 4, the second by —1, and add —3f, to their sum, we get 


en — 48 . 
—3fy + Af, — fo = (—3 +4 — Dfy + hf + 2 - DiS + “FUMED ~ If". 


Solving for fy, we get 


-3f+ 4fi— fe. Mp 
fo = PEAS + Sais - FDI. 


xX < & <x +h, Xp < & < Xo + 2h. (B.18) 


The last term of Eq. (B.18) is the error term. As h — 0, the two values of € approach 
the same values. Consequently the error term approaches te f"(X). We then conclude 
that the error of Eq. (B.14) is O(7). 

For the error of Eq. (B.15), we proceeded similarly. We leave the development as 
an exercise to show that 


. = Fin Ls oe 
fi= ah — GPP"), Hy AS E<aQ th, 
=f=F1 + ou), (B.19) 
2h 


The error terms of both Eqs. (B.18) and (B.19) are O(/), but the coefficient in Eq. 
(B.19) is only half the magnitude of the coefficient in Eq. (B.18). The progressive in- 
crease in accuracy we anticipated is confirmed. 

By similar arguments, we can show that the formulas for third and fourth derivatives, 
Eqs. (B.11) and (B.12), both have errors O(h?). 


A-12 APPENDIX B 


B.3 


Figure B.1 


INTEGRATION FORMULAS BY THE METHOD OF 
UNDETERMINED COEFFICIENTS 


A numerical integration formula will estimate the value of the integral of the function 
by a formula involving the values of the function at a number of points in or near the 
interval of integration. Figure B.1 illustrates the general situation. If we desire to evaluate 
J® f(x) dx, it is obvious that if we could find some average value of the function on the 
interval [a, b], the integral would be: 


b 
[4 dx = (b — a)fyy. 


It seems reasonable to assume that f,, could be approximated by a linear combination of 
function values within or near the interval.* We therefore write, similarly to our procedure 
for derivatives, 


b 
[ F(x) dx = cof (xp) + ey f(y) + ++ + + Cnf(Xp), 


where the coefficients c; are to be determined. The points x; at which the function is 
to be evaluated can also be left undetermined but, for the formulas which we need to 
derive, we will again impose the restriction that they are equally spaced with the value 
of xj; — x, = Ax = h = constant. We will impose the further restriction that the 
boundaries of the interval of integration, a and b, coincide with two of the x;-values. 


M=a xX % % & =O 


We start with a simple case. Suppose we wished to express the integral in terms of 
just two functional values, specifically f(a) and f(b). Our formula takes the form 


‘b 
i f(x) dx = cgf(a) + cy f(b). (B.20) 


Obviously, the values of cy and c, depend on f(x), and probably on the values of a and 
bas well. The method of undetermined coefficients assumes that f(x) can be approximated 
by a polynomial over the interval [a, b] and determines co and c, so that Eq. (B.20) is 
exact for all polynomials of a certain maximum degree or less. 

Equation (B.20) has only two arbitrary constants, so we would not expect that it 
could be made exact for polynomials of degree higher than the first, since more than two 
parameters appear in second- or higher-degree polynomials. We therefore force Eq. (B.20) 


*One could generalize the concept by extending this to include derivatives of the function as well, but we stay 
with the simpler case. 
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to be exact when f(x) is replaced by either a first-degree polynomial or one of zero degree 
(a constant function). If it is to be exact for all first-degree polynomials, it must be exact 
for the very simple function f(x) = x. If it is to be exact when the function is any constant, 
it must hold if f(x) = 1. We express these two relations: 


b 
[re 


a 
b 

I (1) dx 
a 


The integrals of Eqs. (B.21) are easily evaluated, so we have, as conditions that the 
constants must satisfy, 


i 


co(a) + c,(b), 


e(1) + (1). (B.21) 


v/a 


coa + c,b, 


Ml 


co +c. 


Solving these equations gives ¢g = (b — a)/2, c, = (b — a)/2. 
Consequently, 


b = 
I fey = 2 —41f(a) + FO). (B.22) 
F 2 


This formula is the trapezoidal rule, and was derived in Chapter 3 as Eq. (3.10) through 
an entirely different approach. We have put a dot over the equality sign in Eq. (B.22) to 
remind ourselves that the relation is only approximately true, for the function f(x) cannot 
ordinarily be replaced without error by a first-degree polynomial. It is intuitively obvious 
that unless the interval [a, b] is very small, the error will be considerable. 

We can reduce the error in the above formula by using a higher-degree polynomial 
to replace f(x), but we would then need to take a linear combination of more than two 
function values. In fact, we will have to preserve a balance between the number of 
undetermined coefficients in the formula and the number of parameters in the polynomial 
The number of coefficients hence must be one greater than the degree of the polynomial. 

We now look at the next case, a three-term formula corresponding to replacing f(x) 
by a quadratic. For three terms, using x9 = a and x, = b, with x, at the midpoint, 


m + 
[ f(x) dx = cof(a) + at(* 7 *) + eof(b). 


The formula must be exact if f(x) = x°, or f(x) = x, or f(x) = 1, so 


b 3 3 2 
2 b 2 if 
[ v de = 3 — 5 = ca” + (2 5 *) + cob’, 


b 2 2 + 
I xde =F -F = cya + 045) + nb, 


b 
[aqb-a=mtaten 
la 


| 
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The solution is cg =o ib —a),c; = 4b — a). Hence 


[1 dx = >= “(Fa) + a/(23?) +f). (B.23) 


A more common form of Eq. (B.23) is found by writing b — a = 2h (the interval from 
a to b is subdivided into two panels) and substituting x9 for a, x2 for b, and x, for the 
midpoint. We then have 


xo+2h oh 
[fen ae = Fo + 4h +f. (B.24) 
Formula (B.24) is Simpson's i tule, a particularly popular formula. Again, it is not 
exact—as indicated by the dot over the equality sign. 

The application of Eqs. (B.22) and (B.24) over an extended interval is straightfor- 
ward. It is intuitively apparent that the error in these formulas will be large if the interval 
is not small. (We discuss these errors quantitatively in Section B.5.) To apply these to a 
large interval of integration and still maintain control over the error, we break the interval 
into a large number of small subintervals and add together the formulas applied to the 
subintervals. When this is done we get the extended trapezoidal rule: 


‘b 
i 
[ fo dx = eo) + f(xy) + 2flra) + + + 2FO%q—1) + fx); (B-25) 
and the extended Simpson’s 4 rule: 


‘b 
b-a 
[ro dx = Fam Uo) + Af(x,) + 2f(x2) + 4f(a3) + ++ 
+ 2f (X22) + 46 _~1) + f%2,))- (B.26) 


These forms of the trapezoid rule (Eq. (B.25)) and of Simpson's q rule (Eq. (B.26)) are 
widely used in computer programs for integration. By applying them with n increasing, 
the error can be made arbitrarily small.* For Simpson’s rule, observe that the number of 
subintervals must be even. 


INTEGRATION FORMULAS USING POINTS OUTSIDE 
THE INTERVAL 


In studying numerical methods to solve differential equations we shall have need for some 
specific integral formulas that involve function values computed at points outside the 
interval of integration. Figures B.2, B.3, and B.4 sketch three special cases of importance. 


‘ 
i 
| 


*Except for round-off error effects, which eventually will dominate since they are not decreased by small 
subdivision of the interval and, in fact, may increase as the number of computations increases. 
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Figure B.2 a Sr 


Figure B.3 x, 


Figure B.4 


In Fig. B.2 the’curve that passes through the four points whose abscissas are x, 3, x,—2, 
X,—1, and x, is extrapolated to x,,,,, and we desire the integral only over the extrapolated 
interval. Figure B.3 presents the case where the curve passes through four points, and 
the integral is taken only over the last panel. In Fig. B.4 we consider a case where a 
curve that fits at three points is extrapolated in both directions, and integration is taken 
over four panels. 

In the derivations of this section it will be convenient to adopt the notation that f, = 
f(x,), so that subscripts on the function indicate the x-value at which it is evaluated. 

For the first case (Fig. B.2) we desire a formula of the form 


Kn 


f(x) dx = cofn—3 + Cifn-2 + Cofn-1 + Cathe 


Xn 
With four constants, we can make the formula exact when f(x) is any polynomial of 
degree 3 or less. Accordingly, we replace f(x) successively by x, x?, x, and 1 to evaluate 
the coefficients. 
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It is apparent that the formula must be independent of the actual x-values. To simplify 
the equations, let us shift the origin to the point x = x,; our integral is then taken over 
the interval from 0 to h, where h = x.) — X)2 


h 
i! F(x) de = cof(—3h) + ey f(—2h) + caf(—h) + c3f(0). 


Carrying out the computations by replacing f(x) with the particular polynomials, we have 


4 
= co(—3h)3 + ¢(—2h)3 + c9(—h) + €3(0), 

hn 

Bie ¢o(—3h)? + €,(—2h)?? + c(—h)? + €(0), 

s = ¢o(—3h) + c(—2h) + c3(—h) + c3(0), 

h = co(1) + €,(1) + €2(1) + €3(1). (B.27) 


After completion of the algebra we get 
Xn) 
[feo de = Fes + 31-2 ~ SMe + SSF). (B28) 


For the second case, illustrated by Fig. B.3, we again translate the origin to x, to 
simplify. The set of equations analogous to Eqs. (B.27) is 


4 
© = o(—2h)3 + c\(—hy + c2(0) + c(h), 
nw 

ie o(—2hy? + €y(—h)? + €(0) + €3(h)?, 
f = co(—2h) + c(h) + €(0) + €3(h), 

h = eo(1) + e(1) + e2(1) + €3(1). 


These give values of the constants in the integration formula: 
Xn+1 h 
i F(a) de = 54 n-2 — Se + 19 + M1). (B.29) 
i 2 


The third case, Fig. B.4, where extrapolation of a quadratic in both directions is 
involved, leads to the equation 


Int) 


fx) de = eh Si Peay (B.30) 
a! 


Xn 


Setting up the equations for this case is left as an exercise for the student. 
The particular methods for solving differential equations that use the formulas derived 
in this section are known as Adams’ method and Milne’s method. 
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B.5_ ERROR OF INTEGRATION FORMULAS 


In each of the formulas that we derived above, we used the symbol = to remind us that 
the formulas are not exact unless f(x) is in fact a polynomial of a certain degree or less. 
It is important to derive expressions for the error. 

We first determine the error term for the trapezoidal rule over one panel, Eq. (B.22). 
Begin with a Taylor expansion: 


i Leow 
F(x,) = F(x) + F'(xp)h + Leryn? + org he, 
My < E <x HA th. (B.31) 


Let us define F(x) = fi f(t) dt so that F'(x) = f(x), FQ) = f'), F(x) = f"(x). 
Equation (B.31) becomes, with this definition of F(x), 


pal ia : i, Big Daw 
| fx) dx = | L009 de + fxoh + sfx? + Ef (Eh. 
a a 2 
On rearranging, and using subscript notation, we get 
L f(x) dx — I f(x) dx = [pena = foh + brole + tp (E Dh. (B.32) 


We now replace the term /( by the forward-difference approximation with error term 
from Section B.2: 


5 — fo 1 
h=- h o sAf'(&), x <& <x. 
Equation (B.32) becomes 
ii I oe loxsg 
| f(x) dx = foh + shUf, — fo) - run (&) + ef (&). 
to 2 


We can combine the last two terms into —brf'®, where é also lies in [X9, x)]. 
Hence, 


‘ ; 
i fx) de = 5S +f) ~ ere. 


The error of the trapezoidal rule over one panel is then O(h*). This is called the local 
error. Since we normally use a succession of applications of this rule over the interval 
{a, b] to evaluate fo f(x) dx, by subdividing into n = (b — a)/h panels, we need to 
determine the so-called global error for such intervals of integration, The global error 
will be the sum of the local errors in each of the n panels: 


here) — Bere — + — LE 


Global error 


i 


here HFRS Perot PEI, 
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Here the subscripts on € indicate the panel in whose interval the value lies. If f"(x) i 
yi? continuous throughout the interval of integration, 


I DG) =f", a< <b. 


Using n = (b — a)/h, we get 


eee OG he \ ey en ek 
Global error = se( z ys (6) = ~4a_FP'© = OW). 


Therefore the global error is O(h?), even though the local error is O(h*). 
Similar treatment shows that the local error for Simpson’s ; tule is 


1 psrivey = ons 
a0" P(E) = OU), 
and the global error is 


1 
— T3992 - ayh*f*(€) = O(n’). 


One can also show that the local error of each of the integration formulas in Section B.4 
is O(h). 
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Software Libraries 


Besides the programs and subroutines presented in this book, there are many other 


excellent programs and subroutines. Among these are: 


IMSL (INTERNATIONAL MATHEMATICAL AND 
STATISTICAL LIBRARY) 


This is a library of well over 400 subroutines available commercially and perhaps the 
most widely used for scientific programming. These subroutines cover both mathematical 


and statistical areas. 


The library is divided into the following chapters, each of which consists of many 


subprograms. 
Analysis of Variance. 
Basic Statistics. 


Categorized Data Analysis. 


Eigensystem Analysis. 


Generation and Testing of Random Numbers 
Interpolation, Approximation, and Smoothing. 
Linear Algebraic Equations. 

Mathematical and Statistical Functions. 
Non-Parametric Statistics. 

Observation Structure and Multivariate Statistics. 
Regression Analysis. 

Sampling. 

Utility Functions. 

Vector and Matrix Arithmetic. 


N<GYRPOZErrammoa> 


Zeroes and Extrema, Linear Programming. 


Differential Equations, Quadrature, and Differentiation 


Forecasting, Econometrics, Time Series, and Transforms 
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In order to use any IMSL subroutine, one must write a FORTRAN main program 
that calls the appropriate IMSL subroutine. More recently, IMSL, Inc. has introduced 
the PROTRAN system that uses the IMSL subroutines and allows access to them in a 
simple, easy language. Both IMSL and PROTRAN are products of IMSL, Inc. 

Source: IMSL, 2500 ParkWest Tower One, 2500 CityWest Blvd., Houston, TX 
77042. 


TWODEPEP 


This is a package of programs devoted to solving two-dimensional elliptic and parabolic 
problems in partial-differential equations by the finite-element method. This work is also 
a product of IMSL, Inc. 


LINPACK 


This is a collection of subroutines for solving linear systems of equations. The listing of 
the programs is published in a book by Dongarra et al. (1979). The programs themselves 
are distributed by IMSL, Inc. 


ELLPACK 


The ELLPACK system solves elliptic problems in two and three dimensions by a large 
number of methods. ELLPACK allows the user to state the problem in simple, mathe- 
matical language. ELLPACK consists of over 50,000 lines of FORTRAN code, but using 
a FORTRAN preprocessor allows for easy interaction with the large list of programs in 
the ELLPACK system. 

Source: Professor John R. Rice, Math Science 428, Purdue University, West Lafay- 
ette, IN 47907. 


PDECOL 


These are a collection of subroutines for solving parabolic and hyperbolic partial equations. 
This program is listed as ACM Algorithm 540 and is distributed through IMSL, Inc. 


ACM ALGORITHMS 


ACM (The Association for Computing Machinery) provided a listing of the algorithms 
submitted to its journal, Transaction on Mathematical Software. The titles of many of 
them are listed in the back of this journal. For a listing and tapes one can write to: ACM. 
Algorithms Distribution Service, IMSL, Inc., and so on. 

For a detailed study of these and other software items, one should consult Rice (1983). 


SOFTWARE LIBRARIES A-21 


SOFTWARE FOR PERSONAL COMPUTERS 


In the past few years many of the good software libraries have been ported to personal 
computers—most commonly, PC/ XT/ATs. In addition, there are two new ones that were 
written especially for the microcomputer. Only a few of the many excellent ones are 
given here. The first three presented here are well known to scientific programmers. 


IMSL 


IMSL has subdivided its subroutines into three distinct libraries: MATH, STAT, and 
SFUN. This last stands for special functions. These libraries can be leased separately or 
together. The programs are in FORTRAN with the same excellent documentation IMSL 
provides for its original mainframe version. 

Source: IMSL, 2500 ParkWest Tower One, 2500 CityWest Blvd., Houston, TX 
77042. 


NAG 


The Numerical Algorithms Group provides a library of 688 FORTRAN subprograms for 
mainframes. However, the NAG Fortran PCSO Library contains 50 of the most commonly 
used programs that have been implemented for personal computers. 

Source: Numerical Algorithms Group, Inc., 1101 31st St., Suite 100, Downers 
Grove, IL 60515-1263. 


THE SCIENTIFIC DESK 


This library contains more than 350 routines that cover both mathematics and statistics. 
The library also includes FORTRAN-callable graphics, plotting, and cursor control sub- 
routines. It includes a set of interactive problem solvers that require no programming. 
The tutorials, problem solvers, and documentation can be accessed through a menu-driven 
interface. This package is also available for the mainframe. The work was begun by the 
founder and first president of IMSL. 

Source: C. Abaci, 208 St. Mary’s St., Raleigh, NC 27605. 


BORLAND 


Turbo PASCAL Numerical Methods Toolbox contains a large selection of numerical 
programs written in Turbo PASCAL. 
Source: Borland International, 4585 Scotts Valley Dr., Scotts Valley, CA 95066. 


NUMERICAL RECIPES 


The FORTRAN subroutines and the PASCAL procedures as listed in the book Numerical 
Recipes: The Art of Scientific Computing are available on separate disks. 

Source: Customer Services Department, Cambridge University Press, Edinburgh 
Building, Shaftesbury Rd., Cambridge CB2 2RU, England. 
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CHAPTER I 


Answers to 
Selected Exercises 


40. 


Successive iterates are 0.5, 0.75, 0.625, 0.5625, 0.59375, 0.60938. For 5 accurate digits, 
the error must be S0.00005, which requires 15 iterations. The error bound is then +0,5!5 
= +0.00003 


b) 0.4450 c) 0.9210. 
0.55496, 2.2470, —0,.80194 
—0.458962 


Roots are 0 + i, ~2 + 4/. For the particular computer programs used, the execution times 

were 0.13 sec for the Bairstow method and 0.49 sec for the Muller method 

Ten iterations. Using Aitken acceleration, 4 iterations are required (2 extrapolations). 

Suitable rearrangements are 

,[—4x? + 2 

ei mae 

Some of these may converge slowly. 

Iterates: 1.0, 1.5, 1.75, 1.875, 1.9375, . . . , 2.0. Errors: 1.0, 0.5, 0.25, 0.125, 0.0625, 
, 0.0. This is linearly convergent; Aitken acceleration applies. 


Roots found are 0.791288 and 1.61803, leaving a quadratic whose roots are —0.618034 and 
—3.79129 


P(x) = (x + I(x? + 3.6x? + 3x — 14) = (x + 1) — 1.4)(x? + Sx + 10) 


After three iterations r = 4.1, s = —5.2; factor is x7 — 4.1x + 5.2. From the reduced 

polynomial, the other factor is x? + x + 1, Zeros of the polynomial are at 2.05 + 0.99877, 

—0.5 + 0.86607. 

a) 1.8019, —1.2470, 0.44504. 

d) Two e-values oscillate, indicating two pairs of complex roots. Factors are x? — 4.14 + 
5.2 andx? +x + 1. 


A25 


A26 ANSWERS TO SELECTED EXERCISES 


CHAPTER? 


62. 


14. 


20. 


va 8 
32. 


b) Chopped: 0.174. 10°, error 0.6 x 10°. 
Rounded: 0.177 x 10°, error —0.4 x 10°. 


c) 0.135 x 10-*. Absolute error —0.24 x 107-7, relative error —0.18 x 1072. 

b) (1.32 * 1.98) * 1.01 = 1.32 * (1.98 * 1.01) with rounding. 

The results depend on the computer system. With IBM/360, single precision: 

a) 0.9999991 b) 0.9999878 c) 0.9995956 

d) 14.017 after about 1,050,000 terms. It converges because the computer evaluates I/n as 
0, for large enough n, in comparison to the current sum. 

e) The value is greater than the result in (d) because less precision is lost due to exponent 
alignment. 

Absolute min at x = 0 and 1. A relative max (f’(x) = 0) at x = 0.9499. 


3 -17 
18], By=| 76], xy =19. 


38. 55. 

= =T 2 S Ad ot 
-4 4 10), CA=1|9 1 (4). 
=3 2 45 1 2-1 


2, x2 = (—13 + 4)/-3 = 3, x; = (7 + 8 — 9)/-6 = -1. 
1.30, —1.35,,—0.275). 
b) x7 = (1.45, —1.59, —0.276). 
c) Calculated right-hand sides are 


(0.02, 1.02, —0.21) and = (0.04, 1.03, —0.54). 


x7 = (0.5834, 0.6370, 0.4019, —0.6439), 
x4 = (—0.8333, 0.3636, —0.6515, 0.8939), 
xf = (2.419, —0.4401, 2.699, 0.007576). 


Az = b implies 


(B + Ci)(x + yi) = (Bx — Cy) + (By + Cx)i 


=p+qi. 

a) |B “sli =|? 

Cc BLY. a) 
b) 2n? + 2n versus 4n? + 2n. 
b) x7 = (27.051, 8.2051, 5.7692, 14.872, 53.718). 
b) 7c, — 12c, — 3c; = 0. 
a) det(H) is very small (about 1.65 x 107); one also cannot avoid a very small divisor in 

solving the system. 

b) x7 = (1.11, 0.228, 1.95, 0.797). 


c) x7 = (0.988, 1.42, —0.428, 2.10). 
d) x7 = (1.0000, 0.9995, 1.0017, 0.9990). 
det = 51. 
16.00 —120.0 239-59" —133:9 
His —120.0 1199. —2699. 1679. 


239.9 —2699. 6477. —4198. 
—139.9 1679. —4198. 2799. 
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35. 
36. 
38. 


41. 


45. 


25. 
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10.337. 
a) x” = (1592.71, —631.956, —493.653). 


x! = (119.5, —47.14, —36.84). This is further evidence of ill-condition, in that small changes 
in the coefficients make large changes in the solution vector. 


. rll 
1.82 1075, C.N.? = 55,642. 
\lblly 
25.5, dels — | 04, 
Ill, 


The relation is verified no matter whether ||X|| or |{.x{| is used. 
Using 3-digit arithmetic to compute @: 
@ = {—0.00271, 0.00126, 0.00103}, ¥ = {0.153, 0.144, —0.166}. 


(% + @) = {0.15029, 0.14526, —0.16497}, to be compared with x = {0.15094, 0.14525, 
—0.16592}. The improvement is remarkable even in one iteration. 


After six iterations, the Jacobi vector has diverged to x7 = (3.16, 3.16) while with Gauss— 
Seidel, x7 = (~154.5, 467.6). Gauss-Seidel is diverging much more rapidly. 


x7 = (1, —1, 2). 
(0.72595, 0.50295), (— 1.6701, 0.34513). 
(1.64304, —2.34978), (—2.07929, —3.16173). 


x — 1x + 2x — 2 (x — 1)(x + 2)(x + 1 

= an easy AS = Me + D(9) = -0.5x3 + ax? + 0.x - 4. 
—0.00033 min, 

22 =- estimates = 

1.2218. Actual error 0.0004, estimates eer aah 

a) 1.0919 b) 1.0973 c) 1.0941 d) 1.0951 e) 1.0920 

1 
[Q)™ saan and f(0.15) = 1.0956. 


sin(x + 1) 
A sixth-degree polynomial is required to fit exactly. Since the third differences are almost 
constant (in fact, their variation could be due to round-off errors in f(x)), a third-degree 
polynomial will almost fit all seven points 
(0.158) = 0.78801, f(0.636) = 0.65178. 


All results are identical (f(0.385) = 0.74091) because all polynomials are the same though 
of different forms. 


Formula Jo £ y(0.92) 
NGF 0.148 2.1 0.3836 
NGB 0.518 -0.9 0.3836 
GF 0.248 Lt 0.3836 
GB 0.370 0.1 0.3836 
Bessel 0.309 Ld 0.3836 


(The precise value of each y(0.92) was 0.383564.) 
(s)h2¢"(E), x9 < € <x, (based on NGF). 
Set the derivative of [s(s — 1)]/2 equal to zero; this gives a maximum at s = 0.5. 
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32. a) A(fogo) = (fi81) — Uo8o) = fi8i — fo8i + fo8: — fo80 
= fol: — 80) + aif — fo) = foAso + 8iAfo- 
b) Ax” = n! by the argument in Section 3.4. 
VAR" = (1 EOE” = (E = 1x" = x? = nn! 
Vex" = (E'Ay'x” = ME" x" = X(E“ xy? = A(x — 1)" = n! 
because (x — 1)" is an nth-degree polynomial whose leading term is x". 
34. Any, = (WE )E"y, = Vyy4n so r=st+n, 
38. x(25) = 0.6. Estimated error = 0.1613 (min) to 504 (max). 


42. [4.9523 —2.7323 5, —0.46275 
0.13 1.1568 0.18 S,| _ | —0.04359 
0.18 1.6800 0,66 /S;} | 0.21212)" 

0.4412 2.6788 ]LS,, 0.50507. 


So = 5.69235, — 4.69235,, 

Ss = 1.57585, — 0.57576S3. 
57. (1.6, 0.33) = 1.8330, Compare to the true value of 1.8350. 
58. f(1.62, 0.31) = 1.7524. (Analytical value = 1.7515.) 


CHAPTER4 1. f'(0.15): a) 2.715, error 0.180 b) 2.862, error 0.033 c) 2.878, error 0.018. 
f'(0.19): a) 2.170, error 0.116 b) 2.268, error 0.018 c) 2.278, error 0.008. 
f'(0.23): a) 1.810, error 0.078 b) 1,878, error 0.011 c) 1,881, error 0.007. 
2. x = 0.15: a) 0.150 to 0.193 b) 0.017 to 0.034 c) 0,006 to 0.010, 
x = 0.19: a) 0,098 to 0.120 b) 0.010 to 0.017 c) 0.001 to 0.004. 
x = 0.23: a) 0.069 to 0.082 b) 0,006 to 0.010 c) 7 x 1074 to 0,002 


(Round-off errors cause the actual error to fall outside these bounds in some cases.) 
9. Use A-"A’f as an approximation of f”(). The best average value for the derivative uses a 
difference in the center of the range of interpolation. Estimates of errors are 


x = 0.15: a) 0.148 b) 0.015 c) —0.002; 
x = 0.19: a) 0.111 b) 0.009 c) —0.002; 
x = 0.23: a) 0.075 b) 0.006 c) 0. 


Round-off may invalidate such estimates. 


17. Successive computations and extrapolations are 


Ax = 0.08: 1.9706 


1.8865 
Ax = 0.04: 1.9075 1.8874 
1.8875 
Ax = 0.02: 1.8925 
21. a) h: 0.0001 0.001 0.01 0.1 
A i 60 3.50 3.45 3.52 
Error: cd —0.074 —0.024 —0.094 
b) h: 0.0001 0.001 0.01 0.1 
fe 5.00 3.50 3.40 3.52 


Error: —1.574 —0.074 0.026 —0.094 


30. 
33. 
36. 


40. 


65. 


69. 


ANSWERS TO SELECTED EXERCISES A29 


The error terms are of one order of h greater than expected because 


md 
I () ds=0 — when nis odd. 
0 n 


This can be shown by a change of variable so that the range of integration is symmetrical 
about the origin; the integrand changes to an odd function for which the integral is zero. For 
example, with n = 3, take s = 1 + 1, giving 


Pf) COS PP Ret 
i) (3) a= [., 6 a= | r3 dt =0. 


a) 1.7683 b) 1.7728 c) 1.7904. 
h = 1.1 X 10-3, using the upper bound of the error estimate. 


h Integral Extrapolations 
1.0 0.32045 
0.34156 
0.5 0.33628 0.34154 
0.34136 


0,25 0.34009 
Using the upper bound to the error, / must be = 0.135. With A = 0.1, the integral is 1.718283 
(exact value is 1.718282). 
—0.52042 
a 9 95 g 3 
| fix) dx = AGBfy + 3a + 7% fo + fo = Piped Peay, 


0.94608 
a) 2.8285. 
0.785353 + 0.000045 = 0.785398. 
ae 15 2.0 2.5 
—0.505 0.234 -0. ar Spline, 
0.966 0.118 0.168 condition 1. 
*-0.476 —0.240 ss Spline. 
0.761 0.181 0.122 condition 2. 
—0.462 —0.245 -0.161 Spline, 
0.664 0.203 0.132 condition 3. 
y —0.500 —0.267 —0.167 Central 
As 0.644 0.268 0.132 differences. 
—0.4444 —0,250 Breen Analytical 
0.5926 0.2500 0.1280 values. 


The natural spline gives poor results, in comparison to the other techniques. 


Integral = 1.29919. Analytical value = 1.30176. Simpson's $ rule = 1.30160. In this 
example, Simpson's rule has a considerably smaller error. 


b]l 42 4 1 
4 16 8 16 4 
2 8 4 8 2 
4 16 8 16 4 
bee #1 
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13. 


18, 


20. 


33. 


35. 


a) 0.140586 __ b) 0.140587. 
a) 0.1046. 


Integrating with x constant, Ax = 0.2, and taking four intervals in the y-di 
Integral = 0.64307. 


ye) = ae ot ae + xt Bes 

y(0.1) = 1.11589, (0.5) = 2.0245. 

(0.2) = 1.2015, (0.4) = 1.4130, (0.6) = 1.6487. 
With h = 0.01, (0.1) = 1.11418, error = 0.00171. To reduce the error to 0.00005 
(34-fold), we must reduce h 34-fold, or to about 0.00029. 


y(0.1) = 1.11587 (four decimals are correct). The simple Euler method will require about 
340 steps and 340 function evaluations for similar accuracy, compared to four steps and eight 
function evaluations with the modified Euler method. 

a Q 0.2 0.3 0.4 0.5 

y: 2.2150 2.4630 2.7473 3.0715 3.4394 

t 

y 


0.2 0.4 0.6 
2.0933 2.1755 2.2493 


x y Analytical 
0.0 ~1,000000 =~ — 1.000000 
0.2 -0.781269  —0.781269 
0.4 -0.529680 —- —0.529680 
0.6 -0.251188 —0.251188 
0.8 0.049329 0.049329 - 
1.0 0.367880 0.367880 


a) y(0.5) = —0.28326, error = —0.00077. 

b) y(0.5) = —0.28387, error = —0.00016. 

c) (0.5) = —0.28396, error = —0.00007. 

y(4) predicted = 4.2229, y(4) corrected = 4.1149. From these, the error in y, is estimated 
as 0.0037; equivalently y, has three correct digits. (The actual error is 0.0082.) Obviously 
the starting values must have at least three digits of accuracy also. 


x 0.8 1.0 12 1.6 2.0 
y: 2.3163 2.3780 2.4350 2.5380 2.6294 
Error estimate: 0.0003 <0.00005 =0 —0.00005 —0.00002 


After x = 1.2, h was increased to 0.4. 

By Runge-Kutta: 

im 0 0.2 04 0.6 

y 0 0.0004 0.0064 0.0325 

By Adams—Moulton: 

x: 0.8 0.9 1.0 Ll 2 1.25 1.3 1.35 14 

y: 0.1035 0.1669 0.2574 0.3836 0.5581 0.6688 0.7994 0.9541 1.1387 
The step size was halved after x = 0.8 and x = 1.2, to provide accuracy to four decimals. 
a) df/dy = sin x so hy, = (24/9)/1 = 2.67. 

b) With h = 0.267, D cannot exceed 10 x 10~% for N-decimal-place accuracy. 

c) If D = 14.2 x 10-%, A cannot exceed (1/14.2)hmax- 
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Est’d. Actual 
x y f 1+ hf, phy” error error 
1.0 1.000 1.000 1.200 0.015 0 0 
iy 1.100 1.331 1,242 0.022 0.019 0.017 
1.2 1,233 1.825 1,296 0.035 0.052 0.049 
i 1.416 2.605 1.368 0.058 0.115 O11 
14 1.676 3.933 1,437 0.087 0.234 0.247 
1.5 2.069 6.424 1.560 0.173 0.481 0,597 
1.6 2.712 11.766 1.782 0.403 1.091 1.834 


In the above calculations, average values were used to propagate the errors. 


y =z y(0) = 0; 
Zz E 2(0) = 0. 
ia 0 0.2 04 0.6 
y 1 0,982 0,933 0.864 
ae 0 0.022 0,093 0.220 
x y y 
0 1 | 
0.1 0.8950 —1.0995 
0.2 0.7802 —1.1956 
0.3 0.6561 — 1.2847 
0.4 0.5236 1.3629 
0.5 0.3840 —1.4263 
0.6 0.2389 —1,4715 
t x Analytical 
0 0 0 
0.1 0.0715 0.0707 
0.2 0.1983 0.1999 
0.3 0.2004 0.2026 
0.4 —0.0237 —0.0233 
0.5 —0.3747 —0.3784 
0.6 —0.5912 0.5977 
0.7 —0.4377 —0.4419 
0.8 0.0898 0.0924 
With y'(0) = 1.0, y(1) = 2.08536. 
With y'(0) = 0.8, y(1) = 1.81534. 


Interpolating gives y'(0) = 0.86271 for y(1) = 1.9; 
Computations with y'(0) = 0.86271: 


tL 
y: 
Analytical: 


0 
0 
i} 


0.25 0,50 
0.2388 0.5724 
0.2406 0.5750 


0.75 
1.0948 
1.0969 


1.9000 
1.9000 
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oh 
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31. 


35. 
For the inverse, A = —0.10060, 0.14221, —0,24441, which are reciprocals of the above values. 


CHAPTERT 3. 


Using the modified Euler method with h = 0.2, initial slope is —2.4270. 


is 0 02 0.4 0.6 0.8 1.0 
y 1 0.4494 = —0.0119 —0.4449 = —0.7863 —1.0000 
c) & 0 7/8 n/4 3n/8 an/2 

y:; 0 0.3851 0.7108 0.9269 1.0 
c) 6: 0 n/8 n/l4 3n/8 a/2 

y: —1.0199 = —0.9413 0.7175 —0.3830 0.0105 
Using h = 0.2: 
Sy 10.2 0.4 0.6 0.8 
y: 0.2607 (0.7844 1.3592 1.8109 

3 

Analytical: y = = ; Approximating: y = 


3 


b) Change variable. Let u = y — 1 — 2x; so u” = 2x, u(0) = 0, u(1) = O gives y = 
u+1+ 2x. 


a) +2.82843 b) +3.08459 
Analytical: y = Ae~* sin 27x 
From finite differences, using h = } and k = +6,3623: 


c) +3.17664 d) 3.29271, 3.29499. 


es, | 10.25) 0.50 0.75 
y: 0.7788 —0.0607 —0.4664 
Analytical: 0.7788 0 —0.4724 


b) Let Xp = [i] ax iblpeiMiess 


Then p, = 9, wp = 4, wy = A= 6; 


s-[l}s- [ths FAB) 


e) A = 3.49086, x7 = (1, 0.9362, 0.7733). 
A = —9.9404, 7.0319, —4.0915. 


ou Mie — 
ax 2Ax 
a (*) _ lier jer — Mini je)/20X = lye rye — Winn je )/2Ax 
ay\ax) 2Ay 
Ppt 1 
= ra 1 thie when Ax = Ay = h. 
Temperatures are 
0.69 2.08 5.56 14.58 38.19 
0.69 2.08 5.56 14.58 38.19 
The interior temperatures of the plate are 
45.000 90.000 135,000 180.000 
30.000 60.000 90.000 120.000 
15.000 30.000 45.000 60.000 
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Starting at the upper left-hand comer, going left to right and down, we get the following 12 
values for u: 


44.9 89.3 132.1 168.8 
30.5 59.7 88.2 117.3 
18.2 30.4 44.6 60.2 


For points | to 8: 
74.46, 27.69, 47.86, 65.38, 78.85, 12.90, 23.89, 34.82. 


Same answers as Exercise 7. Beginning with initial values of 0° at all interior points, 18 
iterations are required to achieve a maximum change at any point less than 0.0001°. 

Same answers as Exercise 7. Beginning with all interior points at 0 and a tolerance for 
maximum change at any point of 0.0001, the number of iterations required is 


Value of w; 1.05 1.10 1.15 1.20 1.25 1.30 
Number of iterations: 16 14 12 nl 12 13 


a) Each interior point = 0.444. 
b) Values at interior points: 


0.211 0.312 0.342 0.312 
0.312 0.472 0.521 0.472 
0.342 0.521 0.577 0.521 
0.312 0.472 0.521 0.472 
0.211 0,312 0,342 0,312 


0 + 0 + 3d). + br) — 864, + 8 

36,, + 0+ 0+ d. — 86,2 + 8 
0 + dy, + 3d2. + by) — 86, +8 =0 

8 

8 

8 


+ 


3b, + by + 0 + by — 8b 

0 + dy, + 3d32 + 0 — 8d 3, + 

3ds; + by) +0 +0 — Bbq, + 

Same values as in Exercise 21(b). The number of iterations to achieve a change in any value 
less than 0.0001 is 


w: 1.0 11 1.2 1.3 1.35 14 1.5 
No. of iterations: 13 1 10 5 10 11 13 

The calculated value for w optimum is 1.33. The smallest number of iterations occurs with 
that value. 

There is fourfold symmetry. Values in the upper left quadrant are 

20.00 20.00 20.00 20.00 20.00 

50.30 59.21 63.16 64.83 65.30 

61.53 72.13 77.33 79.63 80.28 

Interior points: 


41.08 
19.06 35.17 
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14. 


15. 


(Utiiame symmery, mime poms are sufficient. 
Agr = 1, Caches: fom Ge come 10.45, 10.41, 10.27. 


ar=2 10.79, 90.78, 10.70. 
war=% 7.01, 7.06, 7.41. 


‘These 3s siiold symmetry, so cach node is the same, 33_33°_ 


Same answers as Exercise 9. The number of erations (one vertical and one horizontal 
is & Compare ts w 13 fermions by SOR. using opamem w. S.OR. with w = 
weguies 25 Derations. 


One can conclude that decrease of r is more effective than decrease of Ax m increasing 


Artz = 0.0568: —0.08390 0.00762 —O.01056 —0.01084 

Ara = 0.0936: 0.09738 6.01352 —0.01782 —0.01510 

Ara = 0.2340: 6.01326 0.02357 —O.N2757 0.02126 
Waluss at x = 15.12: 


= 0 i 2 3 - 
x“: 620.1 778.3 33.6 778.3 620.1 


Computing with Az = 2", 7 = 2 im the explict method, the madpoimt reaches 200°F mm 8764 
sec. 
Ader 4, 8, and 20 time steps. waluss ae 
~ 8 O24 O50 -082% 0 
_ 5 0.336 os 0336 6 
a 8 —86I5 O87 -065 0 
‘The econ ae growing. so the sncthod is unstabic. 
8) Agee = 0.9595; the formule gives the same walus. 
) Age = —1-3509; the Gommolz sives —13514. The power method hes great 
wih this ma. 


CHAPTER? 


22. 


ANSWERS TO SELECTED EXERCISES A35 


The equations are of the form 


se 
w= ry th + (1 = 6r\ur 5, 
‘a | 

where r = k At/cp(Ax)? and must be less than }. There are 125 equations to be solved 
at each time step (assuming no symmetry), and if r = $, 16 time steps are needed to reach 
t= 15.12. 

There are three sets of 125 equations, but each set can be decomposed into 25 subsets of 5 
equations, each subset being tridiagonal, so at most 1125 coefficients need to be stored; r 
can be chosen conveniently at unity, so only two time steps will suffice. 


The calculated frequency is 409.9 sec~!. If Ax = 70/4, Ar = 3.050 x 1074 sec. Eight time 
steps are required for a full cycle. This corresponds to a frequency of 409.84 sec™!. 


Using Ax = 0.1, Ar = 0.1 (for r = 1), representative values are 


xt 0.1 0.3 0.5 
t= 0.1: 0.2939 0.7694 0.9511 
r= 0,7: —0.1816 —0.4755 —0.5878 
t= 16: 0.0955 0.2500 0.3090 


These are the same as the analytical results. 
Representative values (compare these to Exercise 2(c)): 


x L/8 L/4 3L/8 L/2 
t= Ar —0,.00651 0.01140 0.01434 —0,01531 
r= 2Ar —0,01140 —0.02085 —0.02670 —0,02868 
r= 11Ar 0.01434 0.02670 0.03518 0.03810 


Compare these to Exercise 2(d): 


t= dtr 0.02344 0.4688 0.7031 0.6927 
t= 2A 0.2188 0.4375 0.4115 0.4062 
t= 1dr —0.2031 —0.1615 —0.1406 0.1302 


The errors expand (or contract) exactly as in the example in Section 9.3 or in Exercise 8 
Whenever the boundary conditions are specified, the error is zero at that boundary. 


Some representative values: 


(0,25, 0.1021): 0.0186; (0.50, 0.0884): 0.0216; 
(0.75, 0,0791): 0.0144; 
(0.25, 0,2042): 0.0333; (0.50, 0.1768): 0.0408; 
(0.75, : 0.0269; 
0.0402; (0.50, 0.2652): 0.0546; 
0.0360. 


Comparison is awkward and would require two-way interpolation, but it is clear that the 
values do not agree. The values by finite differences are 20 to 50% larger. 

Beginning with points at ¢ = 0 where x = 0.2, 0.4, 0.6, and 0.8, intersections of character- 
istics are 


(0.2993, 0.1006), (0.4989, 0.1067), (0.6982, 0.1005), 
(0.3915, 0.1946), (0.5950, 0.2003). 
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Computing along.the characteristics gives u = 0.5986, 0.9978, 1.3964, 0.7830, and 1.1 
These are identical to the analytical solution, u = 2x. 


Representative values at points (x, y), with Ar = 0.0018; 


Point: (0.5, 0.5) (0.5, 1.0) (1.5, 3.5) 

t = 0.0036 0.0625 0.1250 0.1458 
0.0072 —0.0729 —0.1458 —0.4062 

t= 0.0144 —0,0911 —0.2135 —0.0078 


From the normal equations: 


pe Nowe axDy a Da? Dy — dee aay 
NDe-(Say’ NEe#-(Dx) 
If we substitute y = = Y/N and x = = x/N into ax + b = y, we find that 
eye 


y = 1.908% + 0.025. 

€n § = 0.18395 + 9.6027 * 10737, § = 1.2019¢9.9027% 10-47, 
y = 0.104x? + 1.183x + 0,992. 

f = —6.366 + 24.086x — 3.61 1x? + 0,122x3. 


When the degree is increased to 4, B does not decrease: 
Degree 1 2 a 4 
B: Sil 532 85 87 
For (0, 1): integral = —V1 — x7]!, = 0. 
1 
For (1, 1): integral = (-3vi =a + 5 sin"! x) 


For (0,0): integral = sin“! x], = 7. 
For (1, 2); (Use integration by parts) 
integral = —2x?V1 — x? + 4 fx V1 — x? de 
-2eV1 — x? - $V — 2], = 0. 
The error of the ninth-degree Maclaurin series is very small near the origin but increases 
rapidly to 0.04952 atx = +1. 
The error of P; has four maxima (counting the endpoints) and has a maximum error (near 
x = +0.4) of about 0.01975. This is much smaller than the maximum error of the ninth- 
degree expression. 
1 — (x/2) + («?/10) — (%3/120)° 
Maximum error is —0.00003 at x = 1, 
_ 0.851632 — 0.142027, + 1.1620 x 107 “Ty 
sins 1 + 5.18619 x 1037, 


c) R33 = 


Index 


Absolute error, 8, 45-46 
Accuracy of solutions, 124-132 
effect of pivoting, 125 
iterative improvement, 131~132 
residual, 128 
Adams’ method, 361-363 
Adams-Moulton method, 367-373, 383, 694 
convergence criterion, 375 
criterion of accuracy, 370, 375 
stability, 370 
Adaptive integration, 305-310 
A.D.1. method, 508-512, 569-572, 615 
convergence and stability, 572 
programs, 533-536, 583-587 
Aitken acceleration, 24 
Algorithm, 7 
adaptive integration, 309 
Bairstow's method, 60 
Fast Fourier Transform powers, 658 
Gauss-Seidel iteration, 135 
Gaussian elimination, 100 
halving the interval (bisection), 7 
Jacobi iteration, 134 
linear interpolation (regula falsi), 10 
LU decomposition, 109-110 
modified linear interpolation, 11, 53 
Muller's method, 19 
Newton's method, 17 


QR method, 445 

synthetic division, 30 

x = g(x) form (iteration), 23 
Alternating-direction implicit method, 508-512, 

569-573 

program, 533-536, 583-586 
Approximation of functions, 639-658 

Chebyshev expansion, 641 

economized power series, 639-644 

Fast Fourier Transforms, 652-658 

Fourier series, 652 

Padé approximation, 645 

rational function, 644-651 
Auxiliary function, 185 


B-splines, 222-228 


end segments, 225 

properties, 225 
B-spline surface, 233 
Backward difference, 190 
Bairstow’s method, 31-34 

program, 61-63 


Bessel polynomial, 196-197, 199 
Bezier curves, 217-222 

Bezier polynomials, 218 

Bezier surface, 233-235 
Bisection method, 5 

Bolzano method, 5 


12 


Calculus of variations, +26 


Design matrix. 631 

Determinant, $4, 117-118 
determinant, 118 

Diagonally dominant. 136, 484 


$.0.R. (successive over-relaxation), 487-489 
symmetry, 432 

three dimensions, S07 

tiangular setwork, 502-503 


derivatives, 268-272. 688-689 
effect of grid size, 483 
integration, 287, 695-696 
interpolation, 185-186. 200-203 
linear systems, 124-132 
mimmizing (of interpolation), 186, 201 
orginal data, 39, 376 
propagated errors, 40, 376-379 
round-off, 39, 376 

truncation, 39, 377 

wave equation, 595 

Euclidean norm, {22~124 

Euler-Lagrange equation, 427-428 

Euler's method, 352-356 
errors. 353. 360 
modified Euler method, 354, 381 
program, 389-390) 

Explicit method, 547-S55, 359 
instability, 548, 562-568 
program, 378-S80 

Exponent overflow, 45 

Exponent underflow, 45 


FFT, 652-658 
bit reversing, 656 
program, 665-666 
Fimte differences, 418-423, 594 
exact (for wave equation), 599 
Finite-element method, 430-433, $12-S24, 
373-S76 
Floating-point anthmetic. 43-45 
Floating-point numbers, 41 
doubie precision, 44 
exponent overflow, 45 


Forward differences, 190 

Fourier coefficients, 653 
senes, 635, 652 
transform, 654 

Functional, 426, 513 


Galerkin method, 433, 574 

Gauss polynomial, 196, 198 

Gaussian elimination, 95, 98—103 
algorithm, 100 
back-substitution, 97, 102 
LU decomposition, 99, 102, 106-113 
multiple righthand sides, 104 


multiple integration, 316 
Gauss-Jordan method, 103-106 
Gauss-Seidel method, 135-136 


over-relaxation, 140, 487 
program, 163-164 
Gear's method, 385 
Gerschgorin’s theorems, 443-444 
Global error, 353, 356, 377-379 
Gompenz relation, 629 
Graeffe’s method, 38 


Halving the interval (bisection). 5-9 
program, 64-66 
Hamming's method, 384 
Hartley transform, 658 
Hessenberg matrix, 444-445 
Homogeneous equation, 433 
Householder reflector, 445 
Hyperbolic equations, 592-615 
beginning the sojution, 599-600 
finite-difference method, 594~396 
method of characteristics, 602-612 
PDECOL, 544, 698 
program, 616-619 
two dimensions, 613-615 


Identity matrix, 92-93 
Il-conditioned, 99, 126-130, 183, 632 
Implicit method, 556 

stability and convergence, 562-568 
Improper integrals, 304 
IMSL, 52. 148, S44, 697, 699 

DGEAR, 385 

DVERK, 360 

FFTCC, 665 

LEQT2F, 143 

PDECOL, 544 

TWODEPEP, 473 

ZBRENT, 52 

ZRPOLY, 52 

ZSYSTM, 52 

ZSPOW, 148 
Inconsistent equations, 114~115, 128 
Indefinite integrals, 304-305 
Inherited error, 29 
Initial-value problems. 379 


1-3 


1-4 INDEX 


Integration, 285-321 programs, 57-59, 66-68 
Interpolating polynomials, 183-255 Linear systems, 95-121, 132-141 
Bessel, 196, 197 Local error, 353, 355, 369 
bounds on error, 185-186 Lozenge diagrams, 274-278 
divided differences, 186-189 LU decomposition, 99, 106-113, 120, 509, 557 
domain, 191 algorithm, 100, 109 
error term, 185-186, 200-203 matrix inverse, 120-121 
Gauss, 196 pivoting, 92, 109 
Lagrangian, 184-186 program, 153, 158 
Newton-Gregory, 192-196 
next-term rule (for error), 200, 202 
Stirling, 197, 198 Matrices, 89-95 
two dimensions, 229-232 augmented, 96-98 
symbolic derivation, 203-205 characteristic polynomial, 437 
uniqueness, 189 characteristic value, 434 
Inverse of a matrix, 94, 117-121 condition number, 124-130 
eigenvalue, 441, 443 determinant, 94-95, 117-118 
solving equations, 120 diagonal, 92 
Inverse interpolation, 205-207 eigenvalue, 434, 437-448 
by successive approximations, 206 eigenvector, 437, 440 
Inverse power method, 443 Hessenberg, 444-445 
Iterative improvement, 131-132 Householder reflector, 445 
Iterative methods, 132-141 identity, 92-93 
eigenvalues, 436~447 inverse, 94, 117-121 
Gauss-Seidel, 135-136 lower triangular, 93 
Jacobi, 133-135 minors, 94-95 
nonlinear equations, 20-23, 141-143 “nearly singular,” 126 
partial differential equations. 484-495 nonsingular, 115, 117 
relaxation, 137-141 norms, 121-124 
Iteration, 7 properties, 494 
ITPACK, 148 rank, 115 
singular, 113-115, 117 
Jacobi method, 133-135 sparse, 508 
algorithm, 134 trace, 93 
convergence, 136 transpose, 93 
tridiagonal, 93 
Kirchhoff’s law, 88 upper triangular, 93 
Matrix multiplication, 90-91 
Lagrangian polynomial, 184-186 conformable, 91 
error, 185-186 Matrix notation, 89-95 
Laplace’s equation, 476-489 Maximum likelihood, 627 
A.D.1. method, 508-512 Mean-value theorems, 681, 682 
initial estimates, 486 Milne’s method, 364-367, 694 
iterative method, 484-489 convergence criterion, 374-375 
program, 526-530 criterion of accuracy, 365, 375 
Laplacian, 475 instability, 365-367, 370-373 
pictorial operator, 479 Minimax principle, 627, 648, 650 
polar coordinates, 503 Muller's method, 18-20 
three dimensions, 507 program, 73-76 
Least squares, 625-636 Multiple integrals, 312-321 
optimum-degree polynomial, 634 errors, 317-318 
polynomial, 629-634 Newton-Cotes formulas, 312, 314 
programs, 660-665, 667-674 program, 332-334 
Liebmann’s method, 484, 490 variable limits, 318-321 
Linear interpolation, 9-14 Multiple nghthand sides, 104-105 


modified linear interpolation, 11 Multistep methods, 361-373, 384-385 


NAG, 699 

Nested multiplication, 30, 648 

Newton-Cotes formulas, 285-288 
multiple integration, 312-317 


Newton-Gregory polynomial, 192-196, 363, 368 


program, 239-240 
Newton’s method, 15-17 
convergence, 16, 26-27 
divergence, 17 
polynomials, 27 
programs, 69-71, 165-169 
system of equations, 141-147 
Next-term rule (for error), 202, 649 
“Noisy” data, 282 
Noncomformable, 91 
Nonlinear equations, 1-76, 141-147 
Bairstow’s method, 31-34 
bisection method, 5-8 
halving the interval, 5-8 
linear interpolation, 9-14 
modified linear interpolation, 11 
Muller's method, 18-20 
Newton's method, 15-17 
programs, 52-76, 165-169 
quotient-difference method, 34-38 
secant method, 13-14 
starting values, 8, 13, 17 
systems, 141~147 
x = g(x) form (iteration), 20. 
Normal equations, 628, 630 
Norms, 121-124 
Euclidean, 122-124 
Frobenius, 123 
maximum column-sum, 123 
maximum row-sum, 123 
p-norm, 124 
spectral, 123 
Numerical differentiation, 266-285 
central differences, 271, 274, 277 
error, 267-272, 277 
extrapolation, 278-281 
formulas, 283-284 
higher derivatives, 272-274 
lozenge diagrams, 274-276 
“noisy” data, 282 
optimum value of h, 281-283 
programs, 323, 324 
round-off error, 281, 282-283 
splines, 310-312 
undetermined coefficients, 684-689 
unstable process, 282 
Numerical integration, 285-321 
adaptive, 305-310 
error, 286-296 
extrapolation, 291-293, 294-295 
formulas, 297 


INDEX 


Gaussian quadrature, 299-303 
improper integrals, 304-305 
indefinite integrals, 304-305 
multiple integrals, 312-321 
Newton-Cotes formulas, 285-288 
programs, 325-334 

Romberg integration, 291-293 
Simpson's rules, 293-297 
splines, 310-312 

symbolic derivation, 297-299 
trapezoidal rule, 288-290 
undetermined coefficients, 690-696 


Open interval, 680 

Order of h”, 203, 689 

Orthogonal polynomial, 302, 631, 638 
Over-relaxation, 137, 139-141, 487 


Padé approximation, 645 
comparison to Chebyshev series, 651 
comparison to Maclaurin series, 647 
error, 649 

Parabolic equations, 544-576 
A.D.1. method, 568-573 
Crank-Nicolson method, 555-558 
derivative boundary conditions, 558-561 
explicit method, 547-555, 576 
finite-element method, 673-676 
implicit method, 556 
PDECOL, 544 
polar coordinates, 572-573 
programs, 577-587 
spherical coordinates, 573 
two and three dimensions, 568-573 

Parasitic term, 367 

Partial differential equations, 472-619 
classification, 473, 594 

Periodic functions, 652 

Pivoting, 98 
complete, 98 
effect on accuracy, 124-125 
LU method, 112 
partial, 98 

p-norm, 124 

Poisson's equation, 476, 489-492 
derivative boundary conditions, 495-498. 
program, 530-542 

Polynomials, 27-38 

Power method, 439-443 
inverse power method, 442-443 
program, 455-456 
shifted, 442 
smallest eigenvalue, 441 

Power spectrum, 654 


Predictor-corrector methods, 354, 364-367, 373 
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Programs, A.D. methog, 533-536. 583-586 


EU decomposition, 153-156, 158-163 
modified linear interpolation, 52-59 
Muiler’s method, 73-76 

Newton's method, 69-71, 165-169 


tridiagonal, 157-158 
wave equation, 616-619 
x = g(x) method, 71-73 
Propagated error, 40, 376-379, 564-568, G01 


QD method, 34-38 


Quadratic convergence, 16, 26 
Quasilinear equation, 602, 610 


optimum value, 141, 487-489, 492-495 


Relaxation method, 137—141, 487-489, 492-495 


Romberg integration, 291-293 
enor, 317-313 
program, 326-3238 


Bonconvergence, 
Shape functions, 431, 514, 574 
Shooting method, 412-417 


Simpiex method, {14 

Simpson’s rales, 293-297, 600, 692 
error, 293-296 
program, 325 

Singular matrix, 113-115, LIT 
properties, 117 

Software libraries, 697-599 
personal computers, 699 

S.O.R., 487, 491 
relation to cigenvalues, 222-228 

Sources and sinks, 476 

‘Sparse matrix, 308 

Spectral aorm, 123 

Splines, 207-223 
B-spiines, 222-228 
Bezier curves, 217-222 
comparison to Simpson's mie, 312 
derivatives, 310-311 


Symbolic methods, 203-205, 272, 297-299 
Synthetic division, 27, 29-30 
change of variable. 37 
quadratic factors, 31-34 


Trapezoidal rule, 288-290, 691-692 
error, 289-290 

Triangle inequality, 122 

Triangular matrix, 93 

Tridiagonal matrix, 93, 212, 420, 572 
program, 157-158 

Truncation error, 39, 377 

TWODEPEP, 473, 698 

Two-dimensional interpolation, 229-233 
minimizing error, 231 
region of fit, 231 


Under-relaxation, 139-140 
Undetermined coefficients, 298, 684-696 
derivatives, 684-688 
integration, 690-696 
Unique solution, 113, 117 
Unit veetor, 93, 119 
Unsteady state, S44 


INDEX 


Vectors, 91 
column vector, 91 
inner product, 92 
linearly dependent, 116, 127 
norms, 121, 123 
row vector, 91 
scalar product, 92 
unit vector, 93, 119 


Wave equation, 592-596 
d'Alembert solution, 596-597 
PDECOL, 544 
program, 616-619 
two dimensions, 613-615 


= g(x) form, 20-25 
acceleration, 24 
convergence criterion, 23 
divergences 
program, 7 
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